PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

Author unit

Megvii

Purpose

The goal of this paper is to build a powerful and unified framework by extending PETR with its temporal modeling and multi-task learning capabilities.

The main contributions of this paper are:

  1. Convert position embedding to timing representation learning, timing alignment is achieved by pose transformation on 3D PE. A feature-guided position encoding is proposed, which can reweigth 3D PE by 2D image features
  2. A simple but effective method (introducing task-specific queries) is proposed to allow PETR to support multi-task learning, including BEV segmentation and 3D lane detection
  3. The framework proposed in this paper achieves sota performance in 3D object detection, BEV segmentation and 3D lane detection.

method

network structure

![[attachments/Pasted image 20230715171751.png]]

Timing Modeling

Timing Modeling:
3D Coordinate Alignment:
The purpose is to transform the 3D coordinates of frame t-1 to the 3D coordinate system of frame t.
For clarity of description, some symbols are defined here:

  • c(t): camera coordinates
  • l(t) : lidar coordinates
  • e(t) : ego coordinates
  • g: global coordinates
  • T_dst^src : Transformation matrix from the original coordinate system to the target coordinate system

First project the 3D point set in the camera coordinate system of frame t-1 and frame t to the radar coordinate system, and then use the global coordinate system as a bridge to project the 3D point set in the radar coordinate system of frame t-1 to the frame of t In the radar coordinate system,
the point sets of the subsequent t frame and t-1 frame will be used to generate 3D PE

Multi-task Learning

In order to allow PETR to support multi-task learning, different queries are designed, including BEV segmentation and 3D lane detection

  • BEV segmentation

At the beginning, some anchor points are initialized in the BEV space, and then these points are sent to the two-layer MLP to generate seg queries
using the same head in the CVT to generate the final predicted segmentation results

  • 3D Lane Detection

3D anchor lanes as a query
Each lane is composed of an ordered set of 3d coordinate points (n). These point sets are uniformly sampled along the Y axis, and these anchors are parallel to the Y axis.
The 3D Lane head will predict the category of the lane and the offset relative to the x-axis and the z-axis. At the same time, since the length of each lane line is not fixed, a visible vector T (sizn n) is also predicted to control the starting point of the lane

Feature-guided Position Encoder

The process of encoding 3D coordinates to 3D positions in PETR is data-independent. This paper argues that 3D PE should be driven by 2D features, because image features can provide some information guidance, such as depth information.
Therefore, in PETRv2, the 2D features are subjected to two layers of 1x1 convolution, and then finally passed through a layer of sigmoid to obtain attention weights, and the
3D coordinates are passed through another mlp and multiplied by the attention weight to generate 3D PE. 3D PE is destroyed by the addition of 2D features, which are input into the transformer decoder as a key.

Robustness Analysis

Although there is a lot of work on autonomous driving systems, only a few have explored the robustness of autonomous driving methods. This paper explores the influence of several sensor errors on the algorithm.

  • Extrinsic Noise
    Extrinsic noise is very common, such as camera shake leading to inaccurate extrinsic parameters.
  • lost camera
  • Camera delay
    The exposure time of the camera is too long (such as at night), and the image input to the system may be the previous image, which will affect the output

Robustness Analysis Results

  1. Extrinsic noise
    The greater the noise, the greater the performance degradation, and FPE can improve the robustness of external noise
  2. Camera loss: front (5.05% mAP drop) and back (13.19% mAP drop) The impact of camera loss is the largest, and the performance of other camera loss noise is smaller. The angle of view of the back is larger (120°), so it has the greatest impact. (experiment on nuScenes)
  3. Use some unmarked frames instead of keyframes to simulate time delay, 3.19% mAP and 8.4% NDS (delay 0.083s), 26.08 mAP and 36.54% NDS (delay 0.3s)

Guess you like

Origin blog.csdn.net/SugerOO/article/details/131741856