[Paper Description] Learning Depth Estimation for Transparent and Mirror Surfaces (ICCV 2023)

1. Brief description of the paper

1. First author: Alex Costanzino

2. Year of publication: 2023

3. Published in journal: ICCV

4. Keywords: depth perception, stereo matching, deep learning, segmentation, transparent objects, mirrors

5. Motivation to explore: Materials made of transparent or mirrored surfaces (ToM), from glass windows in buildings to reflective surfaces in cars and appliances. This can be a daunting challenge for autonomous agents leveraging computer vision to operate in unknown environments. Among the many tasks involved in spatial artificial intelligence, accurately estimating depth information on these surfaces remains a challenging problem for computer vision algorithms and deep networks. Deep sensing technologies based on deep learning, such as monocular or stereo networks, have the potential to address this challenge when provided with sufficient training data. But datasets with transparent objects rarely provide true depth annotations, which are obtained through very intensive human intervention, graphics engines, or the availability of ToM object-based CAD models.

  1. This difficulty arises because ToM surfaces introduce misleading visual information about scene geometry, which makes depth estimation challenging not only for computer vision systems but even for humans – e.g., we might not distinguish the presence of a glass door in front of us due to its transparency.
  2. On the one hand, the definition of depth itself might appear ambiguous in such cases: is depth the distance to the scene behind the glass door or to the door itself?
  3. On the other hand, as humans can deal with this through experience, depth sensing techniques based on deep learning, e.g., monocular or stereo networks, hold the potential to address this challenge given sufficient training data.
  4. As evidence of this, very few datasets featuring transparent objects provide ground-truth depth annotations, which have been obtained through very intensive human intervention, graphical engines, or based on the availability of CAD models for ToM objects.

6. Work goal: Accurately sensing the presence (and depth) of ToM objects is an open challenge for both sensing technologies and deep learning frameworks.

7. Core idea: This paper proposes a simple and effective strategy for obtaining training data, thereby greatly improving the accuracy of the learning-based depth estimation framework for processing ToM surfaces.

  1. We propose a simple yet very effective strategy to deal with ToM objects. We trick a monocular depth estimation network by replacing ToM objects with virtually textured ones, inducing it to hallucinate their depths.
  2. We introduce a processing pipeline for fine-tuning a monocular depth estimation network to deal with ToM objects. Our pipeline exploits the network itself to generate virtual depth annotations and requires only segmentation masks delineating ToM objects – either human-made or predicted by other networks – thus getting rid of the need for any depth annotations.
  3. We show how our strategy can be extended to other depth estimation settings, such as stereo matching.

8. Experimental results:

Our experiments on the Booster dataset prove how monocular and stereo networks dramatically improve their prediction on ToM objects after being fine-tuned according to our methodology.

9. Paper and code download:

https://openaccess.thecvf.com/content/ICCV2023/papers/Costanzino_Learning_Depth_Estimation_for_Transparent_and_Mirror_Surfaces_ICCV_2023_paper.pdf

https://openaccess.thecvf.com/content/ICCV2023/papers/Costanzino_Learning_Depth_Estimation_for_Transparent_and_Mirror_Surfaces_ICCV_2023_paper.pdf

2. Implementation process

1. Implementation ideas

By replacing ToM objects with similarly shaped textured artifacts, monocular models can be tricked and induced into estimating the depth of opaque objects, ideally placed at the same location in the scene. The method can be implemented by delineating ToM objects, masking them from the image by manually annotating or segmenting the network, and then drawing virtual textures within the masked areas. On the one hand, since detecting suitable ToM objects is crucial to our approach, manual labeling will undoubtedly yield the most accurate selection, although it requires significant annotation costs. On the other hand, relying on segmentation networks would mitigate this cost: one would need some initial human annotation for training, but this would allow for free segmentation of large numbers of images. Unfortunately, the overall effectiveness of our approach will inevitably be affected by the accuracy of the trained segmentation model. However, we believe that annotating images with segmentation masks definitely requires much less effort than depth annotation. Therefore, we decided to explore the two approaches mentioned above.

Readers may think that, based on our intuition, training deep networks to handle ToM objects may not be necessary—in fact, it is sufficient to segment and draw these objects at deployment time before estimating depth. However, we counter that this approach will rely heavily on the actual accuracy of the model trained to segment ToM objects, which does not generalize well. Furthermore, it adds a non-negligible computational cost - namely the inference of the second network. Instead, an offline training or fine-tuning process allows leveraging human annotations - if available
- and, potentially, enables the trained network to learn how to correctly estimate depth on the ToM surface and get rid of the second network, as well as for other depth estimates Frameworks for designing advanced strategies such as deep stereo networks. Our experiments will highlight that the former strategy is ineffective, whereas we achieve substantial improvements in accuracy by fine-tuning deep models using our approach.

In the remaining sections, we describe methods for working with ToM objects. Given an image dataset I, the pipeline construction is described in the following figure: i) surface labeling, ii) inpainting (image) and distillation, and iii) fine-tuning of the deep network on virtual labels. Furthermore, we show how to modify it to fine-tune deep stereo matching networks.

Surface markings. For any image Ik∈I, we generate a segmentation mask Mk that classifies each pixel p as

Determine whether pixels belong to the ToM surface by marking them as 1 or 0 respectively. This segmentation mask can be obtained either through manual annotation or through a segmentation network Θ, such as Mk = Θ(Ik).

Repair (image) and distillation. Given an image Ik and its corresponding segmentation mask Mk, an enhanced image I ~ k is generated, applying a repair operation to replace the pixels belonging to the ToM object with color c:

Then, I~k is input to the monocular depth network Ψ to obtain the virtual depth D~k of the image Ik. The color of Ik is randomly sampled in each frame
. However, depending on the image content, certain colors may have an ineffective effect and increase the blurriness of the scene - for example, embedding white pixels into a transparent object in front of a white wall. To prevent this from happening, a set of N custom colors ci, i∈[0,N−1] are sampled and Ik is repaired using these custom colors in order to generate a set of N enhanced images i~ki. Then, the final virtual depth D~k is obtained by calculating the per-pixel median between N depth maps

As shown, in some cases the paint color may be similar to the background - for example, when using a single gray mask, a transparent object disappears - whereas with an aggregated multicolor paint, it is visible. 

Fine-tuning of virtual tags. The steps outlined so far allow labeling dataset I with virtual depth labels that are not affected by the ambiguity of ToM objects. Our newly annotated dataset can then be used to train or fine-tune the depth estimation network so that it can robustly handle the difficult objects described above. Specifically, during the training process, the original image Ik is input to the network, and the predicted depth D~k is optimized relative to the virtual real map D∗k obtained from the inpainted image. This simple pipeline can significantly improve the accuracy of monocular depth estimation networks when dealing with ToM objects.

Extended to deep stereo matching. The pipeline can be adapted to fine-tune the depth stereo model as shown in the figure.

Once again, we argue that state-of-the-art stereo architectures have demonstrated excellent generalization capabilities when dealing with ToM objects, as the task of matching pixels belonging to non-Lambertian surfaces is inherently ambiguous. Therefore, a monocular depth estimation network is used to obtain virtual depth annotations only for these objects. Given a dataset S consisting of stereo pairs (Lk, Rk), virtual depth labels D∗k are extracted from Lk and triangulated into disparities D∗k according to the extrinsic parameters of stereo matching. Then, the base disparity map dk is predicted by inputting (Lk, Rk) into the stereo network to be fine-tuned. Finally, the disparity value of the ToM object is replaced with the disparity value of dk according to Mk, which is generated on Lk this time. This operation, the merge, is defined as:

αk and βk are scale factors and shift factors. As monocular prediction, it is an unknown scale factor. αk, βk are estimated by least squares estimation (LSE) regression on dk for pixels that do not belong to any ToM object, that is, at Mk(p)=0: 

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/132101669