PETR: Position Embedding Transformation for Multi-View 3D Object Detection

PETR: Position Embedding Transformation for Multi-View 3D Object Detection

Author unit

Megvii

Purpose

Problems with 2D->3D process in DETR3D:

  1. The predicted reference point coordinates may be inaccurate, and the corresponding features may not be obtained when sampling image features.
  2. Only the image features of the projected position of the reference point are used, and global features cannot be learned.
  3. The process of sampling image features is too complicated to apply

The goal of this paper is to propose a simple and elegant 3D object detection framework based on DETR

Contribution summary of this paper:

  1. A simple and elegant framework, PETR, is proposed for multi-view 3D object detection.
  2. A new 3D position-aware representation is proposed
  3. Achieved sota on the nuScenes dataset

method

network structure

Overall network structure:

  1. The pictures of N views are sent to the backbone (resnet50) to extract features
  2. 3D coordinate generator: first discretize the image frustum space into a 3D grid, and then use the camera parameters to transform the grid coordinates and generate 3D space coordinates.
  3. The 3D coordinates and 2D feature are sent to the 3D position encoder to generate 3D position-aware features (each view corresponds to a feature map).
  4. 3D position-aware features are fed into the transformer decoder and interact with object queries from the query generator
  5. The updated object queries are used to generate target categories and 3D bounding boxes

3D coordinate generator:

In order to establish the connection between the 2D image and the 3D space, the points in the camera frustum space are projected to the 3D space, because the points in the two spaces are in one-to-one correspondence.
Same as the DGSN paper, first discretize the camera frustum space to generate a grid (shape: W_F, H_F, d), and then use a transformation matrix to convert the coordinates to 3D coordinates. The 3D space is shared by all view cameras.
Then normalize the coordinates in the 3D space. The normalized
coordinates are transposed once

3D position encoder

The purpose of the 3D position encoder is to obtain 3D features by connecting 2D image features and 3D position information.
The structure of the 3D position encoder:
2D features undergo 1x1 convolution dimensionality reduction, 3D coordinates use mlp to generate embedding, and then add the two, Then use flatten to generate vectors (3D position-aware feature, shape, NxHxW)

Query generator and Decoder

Query Generator:
(The network learns the offset based on the original object queries, which is conducive to network convergence. And the network generates coordinate points in 3D space, which can ensure the convergence of the network. This article tries the settings in DETR. Or generating anchor poitns under BEV cannot guarantee convergence
)
First, initialize a set of learnable anchor points in the 3D world space to obey a uniform distribution from 0 to 1, and
then output the two-layer mlp to generate the initial object queries.


Standard decoder in DETR used by Decoder

Guess you like

Origin blog.csdn.net/SugerOO/article/details/131741776