Paper translation: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries"

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

  The four authors of this paper are from MIT, Toyota Research Institute, Carnegie Mellon University, Li Auto, and were recently accepted by CORL 2021.

  1. Paper link: https://export.arxiv.org/abs/2110.06922
  2. Official open source code: https://github.com/WangYueFt/detr3d

Abstract

  We introduce a framework for multi-camera 3D object detection. Existing methods generally estimate 3D envelopes directly from monocular images, or use deep prediction networks to generate inputs for 3D object detection tasks from 2D information, while our method makes predictions directly in 3D space. Our model extracts 2D features from multiple camera images, then uses a sparse set of 3D object queries to index these 2D features, and uses the camera transformation matrix to establish the link between 3D positions and multi-view images. Finally, our model predicts an envelope for each target query, using a set-to-set loss to measure the difference between the prediction and the ground truth. This top-down approach outperforms bottom-up approaches whose object box predictions are affected by per-pixel depth estimates, since it is not subject to compound errors introduced by the depth prediction model. Furthermore, our method does not require post-processing such as NMS, greatly improving the inference speed. We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.

1. Introduction

  3D object detection using visual information is a long-standing challenge in low-cost autonomous driving systems. While object detection using point cloud data collected by modalities such as LiDAR benefits from the 3D structural information of visible objects, camera-based settings are more unstable as we have to generate 3D envelope predictions only from the 2D information contained in RGB images.

  Existing methods typically build their detection pipelines entirely from 2D computations. That is, they use an object detection pipeline designed for 2D tasks to predict 3D information, such as object pose and velocity, regardless of 3D scene structure or sensor configuration. These methods require several post-processing steps to fuse predictions from different cameras and remove redundant envelopes, thus requiring a trade-off between performance and efficiency. As an alternative to these 2D-based methods, some methods utilize 3D reconstructions to generate pseudo lidar or range inputs from camera images, integrating more 3D computations into the object detection pipeline. They then treated these inputs as data collected directly from 3D sensors, using a 3D object detection method. But this strategy has the effect of compounding errors: Poorly predicted depth values ​​have a strong negative effect on 3D object detection performance.

  In this paper, we propose a more elegant transformation from 2D observations to 3D predictions for autonomous driving that does not rely on dense depth prediction modules. Our method DETR3D (Multi-View 3D Detection) solves this problem in a top-down manner. We leverage the camera transformation matrix and geometric inverse projection to link the extracted 2D features and 3D object detection. Our method starts with a sparse set of object priors that are shared across the dataset and learned end-to-end. To collect scene-specific information, we back-project a set of reference points decoded from the target prior to each camera and obtain the corresponding image features extracted from the ResNet backbone. The features collected from the image features of the reference points are then interacted with each other through multi-head self-attention layers. After a sequence of self-attention layers, we read the envelope parameters from each layer and evaluate performance using a set-to-set loss inspired by DETR.
  Our architecture does not use point cloud reconstruction or displayed depth estimation, making it robust to depth errors. Moreover, our method does not require any post-processing, such as NMS, which improves efficiency and reduces the reliance on manual design methods. On the nuScenes dataset, the performance of our method without NMS is comparable to that of its predecessors with NMS. In overlapping regions of camera views, our method significantly outperforms other methods.

Contributions
  Our main contributions are summarized as follows:

  • We propose an improved model for 3D object detection based on RGB images. Different from those existing works that fuse the object predictions of different camera images in the final stage, our method fuses the information of all cameras in each layer of computation. To the best of our knowledge, this is the first attempt to transform multi-camera detection into a 3D set-to-set prediction problem.
  • We propose a module that concatenates the extracted 2D features and 3D envelope predictions via inverse geometric projection. It is immune to inaccurate depth predictions of two-stage networks and seamlessly uses information from multiple cameras by back-projecting 3D information onto all available frames.
  • Similar to Object DGCNN, our method does not require post-processing, such as fusing per-image or global NMS, and performs comparable to existing NMS-based methods. In the overlapping regions of camera views, our method significantly outperforms other methods.
  • We release the code to facilitate reproducibility and future research.

2. Related Work

2D object detection
  RCNN is the first to use deep learning for object detection. It feeds a set of pre-selected object proposals into a CNN and predicts envelope parameters accordingly. Although this method shows impressive performance, it is an order of magnitude slower than other methods due to the need to do a CNN forward for each object proposal. To solve this problem, Fast RCNN proposes a shared class-learned CNN to process the entire image in one forward pass. In order to further improve performance and speed, Faster RCNN proposes RPN, which shares full image convolution features with the detection network, and the computational overhead of the region proposal is almost zero. Mask RCNN incorporates a mask prediction branch for simultaneous instance segmentation. These methods usually involve multi-stage optimization, which can become very slow in practice. Unlike these multi-stage methods, SSD and YOLO achieve a single dense prediction. Although they are much faster than the above methods, they still rely on NMS to overflow redundant envelope predictions. These methods predict envelopes based on predefined anchors. CenterNet and FCOS change this paradigm, greatly simplifying the general object detection process by turning anchor-by-anchor prediction into pixel-by-pixel prediction.

Set-based object detection
  DETR transforms object detection into a set-to-set problem, which uses Transformer to capture the interrelationship between features and objects. DETR learns to assign predictions to a set of ground-truth boxes, so it does not require post-processing to filter redundant boxes. However, DETR has an important disadvantage, it requires a lot of training time. Deformable DETR analyzes the slow convergence of DERT, and proposes a deformable self-attention module to focus features and speed up training. At the same time, some scholars believe that the slow convergence of DETR is due to the set-based loss and Transformer's cross-attention mechanism. They proposed two variants TSP-FCOS and TSP-RCNN to overcome these problems. SparseRCNN introduces ensemble prediction into an RCNN-style pipeline, outperforming multi-stage object detection without NMS. OneNet found an interesting phenomenon that the dense target detection method does not need NNMS after being equipped with a minimum cost set loss. For the 3D domain, Object DGCNN studies 3D object detection from point clouds, which models 3D detection as message passing of dynamic graphs, and generalizes the DGCNN framework to predict a set of objects. Similar to DETR, Object DGCNN does not require NMS.

Monocular 3D Object Detection
  An early approach to 3D detection using RGB images is Mono3D, which uses semantic and shape cues to select from a set of 3D proposals, and uses scene constraints and additional prior information during the training phase. Roddick uses BEV for 3D detection, and Mousavian uses 2D detection to perform 3D envelope regression by minimizing the 2D-3D projection error. Using 2D detectors as the starting point for 3D calculations has become a standard approach. Other works also explore differentiable rendering and 3D keypoint detection to achieve state-of-the-art 3D object detection performance. All of these methods are based on a monocular camera, and scaling to multiple cameras requires processing each frame individually before fusing the model outputs in a post-processing stage.

3. Multi-view 3D Object Detection

3.1 Overview

  Our architecture takes as input a set of RGB images from a camera whose projection matrix is ​​known, and outputs a set of 3D envelope parameters for objects in the scene. Compared to past approaches, we base our architecture on some high-level requirements:

  • We incorporate 3D information into intermediate computations in our architecture, rather than performing pure 2D computations in the image plane.
  • We do not estimate dense 3D scene geometry, thus avoiding associated reconstruction errors.
  • We avoid post-processing steps such as NMS.

  We address these needs with a novel ensemble prediction module that connects 2D feature extraction and 3D box prediction by alternating between 2D and 3D computations. As shown in Figure 1, our model consists of 3 key components. First, following common practices in 2D vision, features are extracted from camera images using a shared ResNet backbone, augmented with feature pyramids FPN as needed. Second, a detection head that connects the computed 2D features and 3D bounding box prediction sets in a geometry-aware manner is our main contribution. Each layer of the detection head starts with a sparse set of target queries, which are learned from data. Each object query encodes a 3D location, which is projected onto the camera plane and used to collect image features via bilinear interpolation. Similar to DERT, we then use multi-head attention to optimize the target query by incorporating target interactions. This layer is repeated multiple times, alternating between feature sampling and target query optimization. Finally, we train the network using a set-to-set loss.

3.2 Feature Learning

  The input to our model is: a set of images I = { im 1 , … , im K } ⊂ RH im × W im × 3 \mathcal{I}=\{\mathrm{im}_1, \dots, \mathrm{ im}_K\}\subset \mathbb{R}^{\mathrm{H_{im}}\times\mathrm{W_{im}}\times3}I={ im1,,imK}RHim×Wim× 3 (collected by the surrounding camera), camera matrixT = { T 1 , … , TK } ⊂ R 3 × 4 \mathcal{T}=\{T_1,\dots,T_K\}\subset \mathbb{R} ^{3\times4}T={ T1,,TK}R3 × 4 , truth envelope boxB = { b 1 , … , bj , … , b M } ⊂ R 9 \mathcal{B}=\{b_1,\dots,b_j,\dots,b_M\}\subset \mathbb{R}^9B={ b1,,bj,,bM}R9 , and class labelsC = { c 1 , … , cj , … , c M } ⊂ Z \mathcal{C}=\{c_1,\dots,c_j,\dots,c_M\}\subset \mathbb{Z}C={ c1,,cj,,cM}Z. _ eachbj b_jbjContains position, size, orientation, and velocity in BEV; our model works to predict these bounding boxes and their labels from these images. We do not use point clouds captured by high-end lidars.

These images are encoded into four sets of features F 1 , F 2 , F 3 , F 4 \mathcal{F}_1,\mathcal{F}_2,\mathcal{F}_3,\mathcal{F}_4 using ResNet and   FPNF1,F2,F3,F4. Each group of features F k = { fk 1 , … , fk 6 } ⊂ RH × W × C \mathcal{F}_k=\{f_{k1},\dots,f_{k6}\}\subset \mathbb{R }^{H\times W\times C}Fk={ fk 1,,fk 6}RH × W × C corresponds to the feature levels of 6 images. These multi-scale feature dimensions provide rich information for identifying objects of different sizes. Next, we detail our approach to transform these 2D features into 3D using an ensemble prediction module.

3.3 Detection Head

  Existing object detection methods usually use a bottom-up approach, predicting a set of dense enveloping boxes per image, filtering redundant boxes between images, and then fusing predictions from different cameras in a post-processing step. This paradigm has two major drawbacks: dense envelope prediction requires accurate depth perception, and depth prediction itself is a challenging problem; NMS-based redundancy removal and aggregation are non-parallel operations that introduce a large amount of reasoning overhead. We address these issues using the top-down object detection head described below.

  Similar to Object DGCNN and deformable DETR, DETR3D is iterative, using LLAn ensemble-based computation of the L layer generates an enveloping box estimate from the 2D feature map. Each layer contains the following steps:

  1. Predict a set of enveloping box centers according to the target query;
  2. Project these centers onto all feature maps using the camera transformation matrix;
  3. Feature sampling via bilinear interpolation and incorporated into the target query
  4. Describing target interactions using multi-head attention

  Inspired by DETR, each layer ℓ ∈ { 0 , … , L − 1 } \ell \in\{0,\dots,L-1\}{ 0,,L1 } For a set of target queriesQ l = { q ℓ 1 , … , q ℓ M ∗ } ⊂ RC \mathscr{Q}_l=\{q_{\ell 1},\dots,q_{\ell M*} \}\subset\mathbb{R}^CQl={ q 1,,qM}RC starts to operate to generate a new setQ ℓ + 1 \mathscr{Q}_{\ell +1}Q+1。参考点 c ℓ i ∈ R 3 c_{\ell i}\in\mathbb{R}^3 ciR3 From a target queryq ℓ i q_{\ell i}qiEncoding in:
c ℓ i = Φ ref ( q ℓ i ) (1) c_{\ell i}=\Phi^{\mathrm{ref}}(q_{\ell i}) \tag{1}ci=Phiref(qi)( 1 ) whereΦ ref \Phi^{\mathrm{ref}}Phir e f is the neural network. c ℓ i c_{\ell i}cican be considered as the iiHypotheses for the centers of the i enveloping boxes. Next, we obtain the sumc ℓ i c_{\ell i}ciThe corresponding image features are used to refine and predict the final envelope. Then, use the camera transformation matrix to transform c ℓ i c_{\ell i}ci(or more precisely, its homogeneous copy c ℓ i ∗ c_{\ell i}^*ci) into each image: c ℓ i ∗ = c ℓ i ⊕ 1 c ℓ mi = T mc ℓ i ∗ (2) c_{\ell i}^*=c_{\ell i}\oplus1 c_{\ ell mi}=T_mc_{\ell i}^*\tag{2}ci=ci1   cmi=Tmci( 2 ) where⊕ \oplus means concatenation,c ℓ mi c_{\ell mi}cmiIndicates the reference point to mmProjections from m cameras. In order to remove the influence of the size of the feature map and the collection of different levels of features, we willc ℓ mi c_{\ell mi}cminormalized to [ − 1 , 1 ] [-1,1][1,1 ] . Next, the image features are obtained: f ℓ kmi = fbilinear ( F km , c ℓ mi ) (3) f_{\ell kmi}=f^{\mathrm{bilinear}}(\mathcal{F}_{km}, c_{\ell mi})\tag{3}fk m i=fbilinear(Fkm,cmi)( 3 ) wheref ℓ kmi f_{\ell kmi}fk m iis mmThe m camera at thelll layerkkk gradeiiFeatures of i points.

  A given reference point is not necessarily visible in all camera images, so we need some heuristics to filter invalid points. To achieve this effect, we define a binary value σ ℓ kmi \sigma_{\ell kmi}pk m ito indicate whether the reference point projection is outside the image plane. The final feature f ℓ i f_{\ell i}fiand the target query q of the next layer ( ℓ + 1 ) i q_{(\ell + 1)i}q(+1)i由下式给出: f ℓ i = 1 ∑ k ∑ m σ ℓ k m i + ϵ ∑ k ∑ m f ℓ k m i σ ℓ k m i   a n d   q ( ℓ + 1 ) i = f ℓ i + q ℓ i (4) f_{\ell i}=\frac{1}{\sum_k \sum_m \sigma_{\ell kmi} + \epsilon}\sum\limits_{k}\sum\limits_{m}f_{\ell kmi}\sigma_{\ell kmi}  and q_{(\ell + 1)i}=f_{\ell i} + q_{\ell i} \tag{4} fi=kmpk m i+ϵ1kmfk m ipk m i and q(+1)i=fi+qi( 4 ) whereϵ \epsilonϵ is a minimum value that prevents division by zero. Finally, for each target queryq ℓ i q_{\ell i}qi, we use the neural network Φ ℓ reg \Phi_{\ell}^{\mathrm{reg}}PhiregΦ ℓ cls \Phi_{\ell}^{\mathrm{cls}}PhiclsPredict an envelope box b ^ ℓ i \hat{b}_{\ell i}b^iand its class label c ^ ℓ i \hat{c}_{\ell i}c^i b ^ ℓ i = Φ ℓ r e g ( q ℓ i )   a n d   c ^ ℓ i = Φ ℓ c l s ( q ℓ i ) (5) \hat{b}_{\ell i}=\Phi_{\ell}^{\mathrm{reg}}(q_{\ell i})  and \hat{c}_{\ell i}=\Phi_{\ell}^{\mathrm{cls}}(q_{\ell i})\tag{5} b^i=Phireg(qi) and c^i=Phicls(qi)( 5 )   We calculate the predictionB ^ ℓ = { b ^ ℓ 1 , … , b ^ ℓ j , … , b ^ ℓ M ∗ } ⊂ R 9 \hat\mathcal{B}_{\ ell}=\{\hat b_{\ell 1},\dots,\hat b_{\ell j},\dots,\hat b_{\ell M*}\}\subset\mathbb{R}^9B^={ b^ 1,,b^j,,b^M}R9 C ^ ℓ = { c ^ ℓ 1 , … , c ^ ℓ j , … , c ^ ℓ M ∗ } ⊂ Z \hat\mathcal{C}_{\ell}=\{\hat c_{\ell 1},\dots,\hat c_{\ell j},\dots,\hat c_{\ell M*}\}\subset\mathbb{Z} C^={ c^ 1,,c^j,,c^M}Z 's loss. In the inference phase, we only use the output of the last layer.

3.4 Loss

  We use a set-to-set loss to measure the set of predictions ( B ^ ℓ , C ^ ℓ ) (\hat\mathcal{B}_\ell,\hat\mathcal{C}_\ell)(B^,C^) and the truth set( B , C ) (\mathcal{B},\mathcal{C})(B,C ) the difference between. This loss consists of two parts: the focal loss for the category label and the L 1 L^1of the envelope box parameterL1 loss. For the convenience of expression, we omitB ^ ℓ , C ^ ℓ \hat\mathcal{B}_\ell,\hat\mathcal{C}_\ellB^,C^The subscript ℓ \ell . Number of truth boxesMMM is usually smaller than the predicted numberM ∗ M^*M , for ease of calculation, we use∅ \varnothing Fill the set of truth boxes toM ∗ M^*M . We link the ground truth and predictions via a bilateral matching problem:σ ∗ = argmin σ ∈ P ∑ j = 1 M − 1 { cj ≠ ∅ } p ^ σ ( j ) ( cj ) + 1 { cj = ∅ } L box ( bj , b ^ σ ( j ) ) \sigma^*=\mathrm{argmin}_{\sigma\in\mathcal{P}}\sum_{j=1}^{M}-1_{\{c_j\ neq\varnothing\}\hat p_{\sigma(j)}(c_j)}+1_{\{c_j=\varnothing\}}\mathcal{L}_{\mathrm{box}}(b_j,\hat b_ {\sigma(j)})p=argminσPj=1M1{ cj=}p^σ ( j )(cj)+1{ cj=}Lbox(bj,b^σ ( j )) whereP \mathcal{P}P represents the set of permutations,p σ ( j ) ( cj ) p_{\sigma(j)}(c_j)pσ ( j )(cj) is the indexσ ( j ) \sigma(j)The predicted categorycj c_j of σ ( j )cjThe probability of , L box \mathcal{L}_\mathbb{box}Lboxis the L 1 L^1 of the envelope parameterL1 loss. We use the Hungarian algorithm to solve this assignment problem, resulting in a set-to-set lossL sup = ∑ j = 1 N − logp ^ σ ∗ ( j ) ( cj ) + 1 { cj ≠ ∅ } L box ( bj , b ^ σ ∗ ( j ) ) \mathcal{L_{sup}}=\sum_{j=1}^N-\mathrm{log}\hat p_{\sigma^*(j)}(c_j)+1_{\{ c_j\neq\varnothing\}}\mathcal{L}_\mathrm{box}(b_j,\hat b_{\sigma^*(j)})Lsup=j=1Nlogp^p(j)(cj)+1{ cj=}Lbox(bj,b^p(j))

4. Experiment

  First, we introduce the dataset, metrics, and implementation in detail in Section 4.1; then we compare our method with existing methods in Section 4.2; in Section 4.3 we use benchmarks to test the performance of different models in camera overlap areas; Section 4.4 We compare with the forward model; in Section 4.5 we provide additional analysis and ablation experiments.

4.1 Implementation Details

Dataset
  We test our method on the nuScenes dataset. nuScenes contains 1000 video sequences, each sequence is about 20s, and the sampling frequency is 20 frames/s. Each frame contains 6 camera images [front_left, front, front_right, back_left, back, back_right], and also provides the inside and outside of the camera. nuScenes is labeled every 0.5s, and there are 28k, 6k, and 6k labeled samples in the training set, verification set, and test set, respectively. There are 23 categories in total, 10 of which can be used to calculate indicators.

Metrics
  We follow the evaluation protocol officially provided by nuScenes. We evaluate average translation error (ATE), average scale error (ASE), average orientation error (AOE), average velocity error (AVE), and average attribute error (AAE). These metrics are true positive (TP) metrics and are calculated in physical units. Additionally, we also compute mAP. In order to evaluate the performance of various aspects of the detection task, a comprehensive metric index - nuScenes Detection Score (NDS) is defined: NDS = 1 10 [ 5 m AP + ∑ m TP ∈ TP ( 1 − min ( 1 , m TP ) ) ] NDS =\frac{1}{10}[5\mathrm{mAP}+\sum_{mTP\in\mathbb{TP}}(1-\mathrm{min}(1, \mathrm{mTP}))]NDS=101[5mAP+mTPTP(1min(1,mTP))]

Model
  Our model contains a ResNet feature extractor, an FPN, and a DETR3D detection head. We use deformable convolutions in the third and fourth stages of ResNet101. FPN inputs the output features of ResNet to generate feature maps with sizes of 1/8, 1/16, 1/32, and 1/64. DETR3D consists of 6 layers, each layer consists of a feature extraction step and a multi-head attention layer. The hidden layer dimension of the DETR3D detection head is 256. Finally, two sub-networks predict envelope parameters and class labels for each target query; each sub-network contains two 256-dimensional fully-connected layers. We use LayerNorm in the detection head.

Training & inference
  We use AdamW throughout the training process. weight_decay = 1 0 − 1 \mathrm{weight\_decay}=10^{-1}weight_decay=101 , the initial learning rate is1 0 − 4 10^{-4}104 , decays to 1at the 8th and 11th epochs105 and1 0 − 6 10^{-6}10−6 . _ The model is trained on 8 RTX 3090GPUs for 12 epochs, each GPU'sbatch_size = 1 \mathrm{batch\_size=1}batch_size=1 . The training process takes about 18 hours, and we do not use any post-processing such as NMS in the inference stage. We use the nuScenes evaluation toolkit for evaluation.

4.2 Comparison to Existing Works

  We compare the previous SOTA methods CenterNet and FCOS3D. CenterNet is an anchor-free 2D detection method that performs dense prediction on high-resolution feature maps. FCOS3D uses the FCOS pipeline to predict each pixel. These methods all convert 3D object detection into a 2D problem, but in doing so ignore scene geometry information and sensor configuration. If target detection is to be performed under multiple viewing angles, the images of each camera in these methods are processed separately, and global NMS is required to remove redundant frames under each lens and in the overlapping area of ​​the cameras. As shown in Table 1, our method outperforms other methods without using any post-processing. Our method is not as good as FCOS3D in terms of mATE indicators. The guess is that FCOS3D directly predicts the depth of the envelope box, resulting in a strong supervision for the target position. Also, FCOS3D uses different headers for different envelope parameters, which can improve performance.
  On the test set (Table 2), our method outperforms all existing methods; for fair comparison, we use the same backbone as DD3D.

4.3 Comparison in Overlap Regions

  Prediction is more difficult in areas where cameras overlap because objects are more likely to be truncated. Our method considers all cameras simultaneously, while FCOS3D performs envelope prediction for each camera individually. In order to further illustrate the advantages of the fusion reasoning method, we separately count the performance of the envelope box falling into the overlapping area of ​​the camera. To compute this metric, we select those envelopes whose 3D centers are visible from multiple cameras. On the validation set, there are 18147 such boxes, accounting for 9.7% of the total. Table 3 presents the results; the NDS score of our method outperforms FCOS3D by a large margin. This confirms that our method of fusing predictions is more effective.

4.4 Comparison to pseudo-LiDAR Methods

  Another approach to 3D object detection is to use a depth prediction model to generate a pseudo-LiDAR point cloud of the surrounding image. On the nuScenes dataset, there are currently no publicly available lidar methods for us to directly compare. Therefore, we implemented the baseline ourselves and validated our method more effectively than showing depth predictions. We use the pretrained PackNet to predict the depth maps of 6 camera images, and then utilize the camera parameters to convert the depth maps into point clouds. We also experimented on self-supervised PackNet with velocity supervision information, but we found that ground-truth deep supervision can produce more realistic point clouds, so we finally use the supervised model as the baseline. For 3D detection, we use the recently proposed CenterPoint architecture. Conceptually, this process is a variant of pseudo-radar. Table 4 presents the experimental results, and we can infer that this pseudo-lidar method is far inferior to our method, despite using the current state-of-the-art model. One possible explanation is that pseudo-lidar object detectors suffer from the compound error of inaccurate depth predictions, which in turn overfit the training data, leading to poor generalization performance on other distributions.

4.5 Ablation & Analysis

  We provide a visualization of targeted query optimization in Figure 2. We visualize the envelopes of each layer decoded from the target query. It can be seen that the deeper the layer of the model, the closer the envelope box and the ground truth box. Likewise, the leftmost graph shows that the learned target query prior is shared by all data. We also quantified the results, as shown in Table 5, which shows that iterative optimization does significantly improve performance. In addition, we provide ablation experiments with the number of target queries in Table 6, and the performance tends to saturate when increasing to 900. Finally, Table 7 shows the comparison results of different backbones.

  We also provide qualitative results in Figure 3 to facilitate a more intuitive understanding of silent cerebrovascular performance. We project the predicted envelopes to the 6 camera images and the top view of the BEV. In general, our method produces reasonable results, even detecting relatively small objects in the distance. However, our method still exhibits large location errors (in line with the tabular results in Section 4.2): although our method avoids revealing predicted depth, depth estimation remains a central challenge in this problem.

5. Conclusion

  We propose a new paradigm for solving the ill-posed inverse problem of recovering 3D information from 2D graphics. In this case, without priors learned from the data, the input signal lacks the necessary information for the model to make effective predictions. Whereas other methods either operate only on 2D computation or use additional deep networks to reconstruct the scene, our method operates in 3D space and uses backprojection as needed to retrieve image features. The benefits of our approach are twofold: (1) reduce the need for mid-level representations (such as predicted depth maps or point clouds), which can be a source of compound errors; (2) by projecting the same 3D point to all available frames to use information from multiple cameras.

  Besides directly applying our work to 3D object detection for autonomous driving, there are several aspects worthy of future exploration. For example, single-point projection creates a limited receptive field in the retrieved image feature map, and sampling multiple points for each object query will contain more information for object optimization. Additionally, the new detection head is input-agnostic, including other modalities such as LiDAR/RADAR will improve performance and robustness. Finally, generalizing our process to other domains, such as indoor navigation and object manipulation, will broaden its range of applications and reveal additional methods for further improvement.

Guess you like

Origin blog.csdn.net/qq_16137569/article/details/120845493