14.CAPE:Camera View Position Embedding for Multi-View 3D Object Detection笔记

14.CAPE:Camera View Position Embedding for Multi-View 3D Object Detection

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

CVPR2023

Article structure:

  • Summary

  • 1 Introduction

  • 2. Related work

    • 2.1 Two-dimensional detection based on DETR
    • 2.2 Monocular 3D detection
    • 2.3 Multi-view three-digit detection
    • 2.4 View Transformation
  • 3. Our approach

  • 4. Experiment

    • 4.1 Dataset
    • 4.2 Implementation Details
    • 4.3 Comparison with state-of-the-art
    • 4.4 Ablation studies
    • 4.5 Visualization
    • 4.6 Robustness analysis
  • 5 Conclusion

  • references

1. For what problem (abstract)

(Introduction) To alleviate the difficulty of view translation from 2D images to global space , a simple and effective method based on local view position embedding, called CAPE (CAmera view Position Embedding), is proposed .

(Abstract) Current methods for detecting 3D objects: query-based methods that rely on global 3D position embeddings to learn geometric correspondences between images and 3D space.

(Abstract) Problem: Directly interacting 2D image features with global 3D PE (Position Embedding) may increase the difficulty of learning view translation due to camera extrinsic changes.

2. Solution (Summary)

Addresses the problem of detecting 3D objects from multi-view images.

The problem of 3D position embedding in 3D object detection methods

Application Field: Autonomous Driving

3. Innovation/Contribution (Introduction)

  • (Introduction) A new multi-view 3D detection method is proposed: called CAPE (CAmera view Position Embedding), based on camera view position embedding, which eliminates the difference in view transformation caused by different camera extrinsic factors.

    (Abstract: The 3D position embedding is formed in the local camera view coordinate system instead of the global coordinate system, so that the 3D position embedding does not need to encode the extrinsic parameters of the camera.)

  • We further generalize our CAPE to temporal modeling by leveraging previous frameworks for object query and explicitly exploiting ego-motion to improve 3D object detection and velocity estimation.

  • Extensive experiments on the nuScenes dataset demonstrate the effectiveness of our proposed method, and we achieve the state-of-the-art (state- of–the- art) level .

4. What method was used (Introduction)

(Introduction) It performs 3D position embedding in each camera's local system rather than in 3D global space . As shown in Fig. 2(b), our method learns the view transformation from 2D image to local 3D space, which eliminates the variation of view transformation caused by different camera extrinsics.

(Introduction) Where the problem is solved: Given that the 3D PE is in local space and the output query is defined in the global coordinate system, we employ a bilateral attention mechanism to avoid mixed embeddings of different representation spaces, as shown in Figure 1(b) .

5 Conclusion

  • In this paper, we investigate the problem of 3D position embedding in sparse query based multi-view 3D object detection methods and propose a simple yet effective method called CAPE.

  • We form the 3D position embedding under the local camera view system instead of the global coordinate system, which greatly reduces the difficulty of view transformation learning.

  • Furthermore, we extend CAPE to temporal modeling, exploiting the fusion between independent queries of temporal frames. It achieves state-of-the-art performance even without lidar monitoring and provides new insights into position embedding in multi-view 3D object detection .

6. Limitations and future work

When it comes to temporal fusion of long-term frames, the computational and storage costs will be prohibitive.

In the future, we will dig deeper into more efficient spatial and temporal interactions of 2D and 3D features for autonomous driving systems.

Guess you like

Origin blog.csdn.net/seasonsyy/article/details/131829687