【CMT】Cross Model Transformer:Towards Fast and Robust 3D Object Detection

Cross Model Transformer:Towards Fast and Robust 3D Object Detection

论文地 Site
Daino land site

1 abstract

This article proposes an end-to-end 3D target detection, converting image and lidar into tokens as the input of the model. The author tries to build a new spatial alignment method to align the features of image and lidar, and has achieved a lot on nuScenes. good result.

2 Introduction

Comparison with BEVFusion and TransFusion
Insert image description here
(a) After Bevfusion is converted into BEV feature through the LSS structure, it is simply concated with the lidar feature
( b) TransFusion first generates queries through high-response areas of lidar features (where target areas may exist), and then uses queries with PC and image features to perform Transformer.
© The CMT model proposed by the author first uses the PE structure to encode the PC and image positions, then adds the position encoding to the feature of the corresponding mode, and then uses the generated information with two position encodings The queries are transformed with PC and image features at the same time to implement position alignment operations for different modes.

3 Method

Insert image description here

3.1 Extract features from different modes and generate Tokens

Use Image backbone and lidar backbone to extract information from image and lidar information respectively, and generate Tokens corresponding to the mode (backbone chooses itself)
Insert image description here

3.2 Position coordinate encoding model (CEM)

Insert image description here

(The author implements aligin for the two modes of data through the position coding structure of image and lidar)
First, for each mode data F(u,v) (featuremap), Construct a corresponding set of 3D points P(u, v), where (u, v) represents the corresponding coordinates (Note: (u, v) in image feature represents h and w, in The lidar represents the BEV coordinates), and then uses the CEM structure to encode P(u,v):
Insert image description here
ψ represents the MLP layer.

Images location encoding

The author was inspired by PETR. The P(u, v) coordinates of encoding Images are consistent with the method in PETR.
Insert image description here
(u, v) represents the coordinates of the featuremap of Images. , d represents the depth coordinate on the depth coordinate axis, and dk represents the depth coordinate. u * dk and v * dk are calculations that implement perspective transformation from image coordinates to 3D coordinates. Then use the camera external parameter matrix K∈ 4 × 4 and the conversion matrix T from the camera coordinate system to the radar coordinate system to transform P(u, v) into the lidar coordinate system:
Insert image description here
Use the CE module Implement encoding of converted coordinates:
Insert image description here

lidar position encoding

lidar position coding is similar to Image position coding, but does not require coordinate system conversion. The P(u, v) method of constructing lidar:
Insert image description here
ud and vd represent BEV. grid size, encoding method:
Insert image description here
Finally, add the generated Im PE and PC PE to the corresponding Im Tokens and PC Tokens, so that the corresponding Tokens carry location information.

3.3 Generator of Query based on location guidance

First generate n anchor points A = {ai = (axi, ayi, azi), i = 1, 2, 3, …, n} based on Anchor-DETR and PETR, using [0, 1] Uniformly distribute the point coordinates in A, and then use the detection range Insert image description here
to extend the generated anchor points to the detection range:
Insert image description here
Map the generated anchor points to Within the coordinates of different modes, use the CEM structure to encode the points of the Image/lidar mode mapped by the anchor point (it should use the same mechanism as the previous position encoding, so as to ensure the encoding in the same mode (as identical as possible), add the position codes of the two modes to get the position code corresponding to the generated query:
Insert image description here
Apc and Aim are the points of the two modes corresponding to A.
Add positional embedding and query content embedding to get the initial query: Q0.

3.4 Decoder and loss

The decoding process uses the L-layer decoding layer, and then uses the FFM layer to output bbox and class:
Insert image description here
Qi represents the updated query of the corresponding decoding layer.
Loss function: class uses focal loss, and bbox uses L1 loss function.
Insert image description here
w1 and w2 are two super parameters.

4 Masked-Modal Training for Robustness

In addition, the author also randomly masks the lidar and Images data to improve the robustness of the model in the case of missing images.
Insert image description here

5 experiments

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_43915090/article/details/133761978