Deep completion algorithm-CompletionFormer-has been open sourced and has the best effect

《CompletionFormer: Depth Completion with Convolutions and Vision Transformers 》

Summary

Given a sparse depth and corresponding RGB image, depth completion aims to spatially propagate sparse measurements throughout the image to obtain dense depth predictions. Although deep learning-based deep completion methods have made great progress, the locality of convolutional layers or graph models makes it difficult for networks to model long-range relationships between pixels. Although recent fully Transformer-based architectures have achieved promising results in terms of global receptive fields, performance and efficiency gaps with mature CNN models still exist due to the deterioration of their local feature details. This paper proposes a joint convolutional attention and transformer block (JCAT) , which deeply couples the convolutional attention layer and the visual transformer into one block as the basic unit to build a deep completion model in a pyramid structure. This hybrid architecture naturally favors the local connectivity of convolutions and the global context of Transformers in a single model . As a result, our CompletionFormer outperforms state-of-the-art CNN-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 datasets, achieving significantly higher efficiency (nearly 1/ 3 FLOPs).

frame

Given sparse depth and corresponding RGB images, the U-Net backbone network enhanced with JCAT blocks is used to perform depth and image information interaction at multiple scales . Features from different stages are fused at full resolution and used for initial prediction. Finally, a spatial propagation network (SPN) is utilized for final refinement.

22cd0c1666e40b5d0d5b4f247b0afde2.png

Convolution and Vision Transformer architecture

 (a) Multipath Transformer block of MPViT. (b) CMT block of CMT-S. (c) Our proposed JCAT block contains two parallel streams, namely the convolutional attention layer and the Transformer layer . (d) Variation of our proposed block with cascaded connections.

6743923c0031dfdcae1754cc60478278.png

Quantitative and qualitative comparison under different radar line numbers

f28d32adec61f213e6b3da466b2d2f1d.png


in conclusion

This paper proposes a single-branch deep completion network CompletionFormer, which seamlessly integrates into one block. Extensive ablation studies demonstrate the effectiveness and efficiency of our model in deep completion when the input is sparse. This novel design produces state-of-the-art results on both indoor and outdoor datasets. Currently, CompletionFormer runs at around 10 FPS: further reducing its running time to meet real-time requirements will be our future work.

Guess you like

Origin blog.csdn.net/CSS360/article/details/132158259