[Jiajiaguai Literature Sharing] MVFusion: Multi-view 3D object detection radar and camera fusion using semantic alignment

标题:MVFusion: Multi-View 3D Object Detection with Semantic-aligned Radar and Camera Fusion

Author: Zizhang Wu, Guilian Chen, Yuanzhu Gan, Lei Wang, Jian Pu

来源:2023 IEEE International Conference on Robotics and Automation (ICRA 2023)

This is the second article shared by Jiajiaguai

Summary

Multi-view radar-camera fusion three-dimensional object detection provides longer detection range and more useful functions for autonomous driving, especially in bad weather. Current radar-camera fusion methods offer a variety of designs for fusing radar information with camera data. However, these fusion methods usually adopt direct concatenation operations between multi-modal features, ignoring the semantic consistency of radar features and the sufficient correlation between modalities. In this paper, we propose MVFusion, a novel multi-view radar-camera fusion method to achieve semantic alignment of radar features and enhance cross-modal information interaction. To this end, we inject semantic alignment into radar features via the Semantically Aligned Radar Encoder (SARE) to generate image-guided radar features. Then, we propose Radar Guided Fusion Transformer (RGFT) to fuse radar and image features to strengthen the correlation of the two modalities from a global scale through a cross-attention mechanism. Extensive experiments show that MVFusion achieves state-of-the-art performance (51.7% NDS and 45.3% mAP) on the nuScenes dataset. We will publish our code and trained network after publication.

Insert image description here
Figure 1. Detection comparison between camera-based method [13] and our MVFusion. (a) Image and radar input, the color of the radar point indicates the distance from the radar. (b) 3D detection ground truth. © Results of a camera-based approach [13] that failed to detect distant cars and nearby pedestrians. (d) Our method leverages semantically aligned radar information for adequate radar-camera fusion and successfully detects missing cars and pedestrians.

Insert image description here
Figure 2. Overview of our proposed MVFusion, which mainly consists of five parts: radar preprocessing module, image encoder, semantically aligned radar encoder (SARE), radar guided fusion transformer (RGFT), and detection network. SARE injects semantic registration into radar features, while RGFT RGFT fuses radar and image features, aiming to fully promote the interaction of the two modalities from a global perspective. The multi-view radar representation is referred to [15].

Insert image description here
Figure 3. Structural diagram of the radar feature extractor (RFE), which includes the residual feature convolution block for sparse radar features.
Insert image description here
Figure 4. Overview of the Image Guided Radar Transducer (IGRT). IGRT assigns learnable positional encodings to radar features to further enhance spatial information through a multi-head self-attention mechanism.
Insert image description here
Figure 5. Radar Guided Fusion Transformer (RGFT) overview. RGFT fuses advanced radar and image features to achieve full correlation under a cross-attention mechanism.
Insert image description here
Figure 6. Comparison of look-around detection results between our method and the previous method [13]. We use yellow circles to represent our method and blue circles to represent the method of [13]. Our method achieves correct object detection at different viewing angles, and our method achieves sufficient object detection at different viewing angles. Our method can correctly detect objects under different viewing angles, where sufficient radar-camera interaction between semantically aligned radar features and visual features provides more useful clues for 3D detection.
Insert image description here
Table 1. Comparison of single-frame state-of-the-art works using different modalities on the nuscenes test set. Represents the use of dd3d [42] pre-training v2-99 [43] backbone network
Insert image description here
Table 2. Comparison of the latest research results on single frames using different backbone networks and modes on the nuscenes val set. † indicates the v2-99 [43] skeleton pre-trained by dd3d [42].
Insert image description here
Table 3. Ablation study on value sets of proposed components. sare " represents semantically aligned radar encoder, "rgft " represents radar guidance fusion transformer.
Insert image description here
Table 4. Semantic aligned radar encoder (SARE) threshold set ablation experiment. si " represents semantic indicator. igt " stands for image guidance radar transducer.

Insert image description here
Table 5. Radar Guidance Fusion Transformer (RGFT) valve group ablation research transformer (RGFT). w " means "yes", "w/o " means "no". means "nothing". q', 'k', 'v' means query, key, value. IMG. means image. concat.' means "connection" ".

in conclusion

This paper provides a novel multi-view radar-camera fusion method MVFusion for 3D object detection, which achieves semantically aligned radar features and robust cross-modal information interaction. Specifically, we propose Semantically Aligned Radar Encoder (SARE) to extract image-guided radar features. After extracting radar features, we propose Radar Guided Fusion Transformer (RGFT) to fuse the enhanced radar features with high-level image features. Extensive experiments on the nuScenes dataset verify that our model achieves state-of-the-art performance in single-frame radar-camera fusion. In the future, we will pool spatiotemporal information from multi-view cameras to further promote radar-camera fusion. ​​

おすすめ

転載: blog.csdn.net/iii66yy/article/details/132254454