SparseBEV: A high-performance, fully sparse pure vision 3D object detector

Author: Wang Limin | Source: CVHub 

Introduction

This paper introduces our new work in the field of 3D object detection: SparseBEV. The 3D world we live in is sparse, so sparse 3D object detection is an important development direction. However, there is still a gap in performance between existing sparse 3D target detection models (such as DETR3D[1], PETR[2], etc.) and dense 3D detection models (such as BEVFormer[3], BEVDet[8]). In response to this phenomenon, we believe that the adaptability of the detector in BEV space and 2D space should be enhanced. Based on this, we propose a high-performance, fully sparse SparseBEV model. On the nuScenes validation set, SparseBEV can still maintain a real-time inference speed of 23.5 FPS while achieving 55.8 NDS performance. On the nuScenes test set, SparseBEV achieved a super performance of 67.5 NDS using only V2-99, a lightweight backbone. If you use ViT-large in methods such as HoP[5] and StreamPETR-large[6] as the backbone, it will be easy to reach 70+.

Our work has been accepted by ICCV 2023, and the paper, code and weights (including our model with 67.5 NDS on the list) have been made public:

Paper : https://arxiv.org/abs/2308.09244
Code : https://github.com/MCG-NJU/SparseBEV

introduction

Existing 3D object detection methods can be classified into two types: methods based on dense BEV features and methods based on sparse query. The former requires the construction of dense BEV space features. Although the performance is superior, the computational complexity is large. The sparse query-based method avoids this process, has a simpler structure and is faster, but the performance still lags behind the BEV-based method. Therefore, we naturally ask: Can the method based on sparse query achieve performance that is close to or even better than the method based on dense BEV?

Based on our experimental analysis, we believe that the key to achieving this goal is to improve the adaptability of the detector in BEV space and 2D space . This adaptability is query-specific, that is, for different queries, the detector must be able to encode and decode features in different ways. This capability is exactly what the previous fully sparse 3D detector DETR3D lacks. Therefore, we proposed SparseBEV, which mainly made three improvements. First, a scale-adaptive self-attention module (scale-adaptive self-attention, SASA) is designed to achieve an adaptive receptive field in the BEV space. Secondly, we designed an adaptive spatiotemporal sampling module to achieve the adaptability of sparse sampling and make full use of the advantages of long time series. Finally, we use dynamic Mixing to adaptively decode the captured features.

As early as February 9 this year, just before ICCV submission, our SparseBEV (V2-99 backbone) had achieved a score of 65.6 NDS on the nuScenes test set, surpassing methods such as BEVFormerV2 [7]. As shown in the figure below, the solution is named SparseBEV-Beta. For details, see the eval.ai list.

8616432a72cd108a34a2ef71a45cc166.png

Recently, we have adopted some of the latest settings from StreamPETR, including adjusting the X and Y weights of bbox loss to 2.0, and using query denoising to stabilize training, etc. Now, SparseBEV, which only uses lightweight V2-99 as backbone, can achieve a super performance of 67.5 NDS on the test set, ranking fourth in the pure visual 3D detection rankings (the top three all use heavyweight ViT- large as backbone):

a8a0ac9a4a4428d9c8c54a52fe33341e.png

Under the small-scale Setting of the verification set (ResNet50, 704x256), SparseBEV can achieve a performance of 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS, taking full advantage of the Sparse design.

b35958eed5a196ca3ed960c302e545ed.png

method

Model architecture

408a355406c5d79ad60cb47b9fdc848e.png

The model architecture of SparseBEV is shown above. Its core modules include scale-adaptive self-attention, adaptive spatiotemporal sampling, and adaptive fusion.

Query Initialization

Existing query-based methods use reference point as query. In SparseBEV, Query contains richer information, including 3D coordinates, size, rotation angle, speed, and corresponding dimensional features. Each query is initialized to the shape of the pillar, which is 0 and approximately 4. This is because multiple objects generally do not appear on the Z-axis in self-driving scenarios.

Scale-adaptive Self Attention

Multi-scale feature extraction in BEV space is important. Methods based on Dense BEV often explicitly aggregate multi-scale features through BEV Encoder (for example, BEVDet[8] uses ResNet+FPN to form a BEV Encoder to extract multi-scale BEV features, and BEVFormer uses Multi-scale Deformable Attention to implement BEV space. multi-scale), while methods based on sparse query cannot do this.

We believe that self attention between sparse queries can play the role of BEV Encoder, but the standard Multi-head self attention (MHSA) used in DETR3D does not have multi-scale capabilities. Therefore, we proposed a scale-adaptive self-attention module (scale-adaptive self-attention, SASA) to let the model determine the appropriate receptive field by itself:

Among them, it represents the Euclidean distance between the two query center points, and represents the control coefficient of the receptive field. As it increases, the attention weight of the distant query decreases, and the receptive field shrinks accordingly. At that time, SASA degenerated into a standard self-attention module with a global receptive field. Here it is generated by using a layer of Linear adaptively for each query feature, and each head is generated differently:

Which represents the query feature dimension and represents the number of heads.

In the experiment, we discovered two interesting phenomena:

  1. The values ​​generated by each head are uniformly distributed within a certain range, and this phenomenon has nothing to do with the initialization. This phenomenon shows that SASA can perform feature aggregation at different scales in different heads, which is similar to the processing method of FPN, and further proves the necessity of multi-scale feature aggregation in BEV space from a data-driven perspective. In addition, compared with FPN, SASA's receptive field is more flexible and can learn freely based on data.

dc3c2ce907871b4d7fc3df13aa15898a.png
  1. There are obvious differences in the values ​​generated by the query corresponding to objects of different categories. We found that the receptive field corresponding to query for large objects (such as buses) is significantly larger than the receptive field corresponding to query for small objects (such as pedestrians). (As shown in the picture below. Note: The larger the value, the smaller the receptive field)

cbb05dc7aafd84036971ece21022b07f.png

Compared with standard MHSA, SASA introduces almost no additional overhead and is simple and effective. In the ablation experiment, using SASA to replace MHSA can directly increase 4.0 mAP and 2.2 NDS:

72cfa8e40db62f334b7588f5f709552c.png

Adaptive Spatio-temporal Sampling

For each query, we use a layer of Linear to generate a series of 3D Offset: Next, we coordinately transform these offsets relative to the query pillar to obtain 3D sampling points. The sampling point generation process is as follows:

ac6f07775dce83b4da597aff0872ecf5.png

In this way, the sampling points we generate can be adapted to the given query, allowing us to better handle objects of different sizes and distances. At the same time, these sampling points are not limited to the inside of the given query bbox, they can even be scattered outside the box, which is determined by the model itself.

Then, in order to further capture long-term information, we warp the sampling points into the coordinate system at different times to achieve inter-frame alignment. In an autonomous driving scenario, there are two types of motion: one is the motion of the car itself (ego motion), and the other is the motion of other objects (object motion). For ego motion, we use the ego pose provided by the data set to achieve alignment; for object motion, we use the instantaneous velocity vector defined in the query and cooperate with a simple uniform motion model to adaptively align moving objects. Both alignment operations can add points:

58dc7ebe3b63004fe42a938b663ce5cb.png

Then we project the 3D sampling points to the 2D image and obtain the 2D features of the corresponding positions through bilinear interpolation. Here is a small engineering detail: Since it is a surround input of six images, DETR3D projects each sampling point into six views respectively, and averages the features extracted from the correct projection point. We found that most of the time only one projected point is correct, and occasionally two (i.e. the sample points are in the overlapping area of ​​adjacent views). So, we simply take only one of the projection points (even if there are sometimes two) and use its corresponding view ID as a new coordinate axis, so that we can do it in one step through the 3D version of Pytorch's built-in grid sample operator. This can significantly increase the speed and not drop any points (in my impression, I only dropped 0.1~0.2 NDS). For details, please see the code: https://github.com/MCG-NJU/SparseBEV/blob/main/models/sparsebev_sampling.py

For sparse sampling, we later wrote a CUDA optimization based on Deformable DETR. However, the pure PyTorch implementation is actually quite fast, and CUDA optimization further increases the speed by about 15%.

We also provide a visualization of the sampling points (the first row is the current frame, and the second and third rows are the previous two frames in history). It can be seen that the sampling points of SparseBEV accurately capture objects of different scales in the scene (that is, they are spatially capable Adaptability), and can also be well aligned for objects moving at different speeds (that is, it is adaptable in time).

f36562a24ea76673d76429a8b29e58c5.png

Adaptive Mixing

Next, we perform Mixing [9] on the channel and point dimensions of the collected features. Assuming a total of frames and sample points per frame, we first stack them into sample points. Therefore, SparseBEV is a stacked timing scheme that can easily fuse information from future frames.

Next, we perform channel mixing on the features obtained from these sampling points, where the weight of the mixing is dynamically generated based on the query feature:

Then perform the same mixing operation on the point dimension:

where and represent the dynamic weights in channel mixing and point mixing respectively. The former is shared among all frames and sampling points, and the latter is shared among all feature channels.

Dual-branch SparseBEV

In experiments, we found that dividing the input multi-frame images into two branches, Fast and Slow, can further improve performance [10]. Specifically, we divide the input into a high-resolution, low-frame-rate Slow branch and a low-resolution, high-frame-rate Fast branch. Therefore, the Slow branch focuses on extracting high-resolution static details, while the Fast branch focuses on capturing motion information. The structure diagram of SparseBEV added to Dual-branch is as follows:

deebda0bb1ad97cc5c545917d6910f0b.png

The Dual-branch design not only reduces training costs, but also significantly improves performance, as detailed in the supplementary material. Its rising point shows that the static details and motion information in the long time series of self-driving cars should be decoupled. However, it makes the entire model too complex, so we do not use it by default (only the row of results with NDS=63.6 in the test set uses it in this article).

Experimental results

b0dde46bcc425da8fd33d8d7356dc4fb.png

The above table compares the results of SparseBEV and existing methods on the verification set of nuScenes, where the representation method uses perspective pre-training. When using ResNet-50 as backbone and 900 queries, and the input image resolution is 704x256, SparseBEV surpasses the existing optimal method SOLOFusion[4] by 0.5 mAP and 1.1 NDS. After using nuImages to pre-train and reducing the number of queries to 400, SparseBEV can still maintain an inference speed of 23.5 FPS while reaching an NDS of 55.8. After upgrading the backbone to ResNet-101 and increasing the input image size to 1408x512, SparseBEV surpassed SOLOFusion by 1.8 mAP and 1.0 NDS.

nuScenes test split
db4b6739f0f0234300196481f9c2cf2c.png

The above table compares the results of SparseBEV and existing methods on the test set, which indicates that the method uses future frames. Without using future frames, SparseBEV achieved 62.7 NDS and 54.3 mAP; its Dual-branch version further improved to 63.6 NDS and 55.6 mAP. After adding future frames, SparseBEV surpasses BEVFormer V2 by up to 2.8 mAP and 2.2 NDS, while the V2-99 we use only has about 70M parameters, which is much lower than the InternImage-XL used by BEVFormer V2 (more than 300M parameters).

limitation

SparseBEV has many weaknesses:

  1. SparseBEV relies heavily on ego pose to achieve inter-frame alignment. In Table 5 of the paper, if ego-based warping is not used, NDS can drop about 10 points, which is almost the same as not adding timing.

  2. The timing modeling used in SparseBEV is stacked timing, and its time consumption is proportional to the number of input frames. When the number of input frames is too large (such as 16 frames), it will slow down the inference speed.

  3. Currently, the training method used by SparseBEV is still a traditional solution. For one training iteration, DataLoader will load all frames. This places high demands on the machine's CPU capabilities, so we use libraries such as TurboJPEG and Pillow-SIMD to speed up the loading process. Then, all frames will go through the backbone, which also has certain requirements for GPU memory. For ResNet50 and 8-frame 704x256 input, the 2080Ti-11G can still fit in; but if the resolution, future frames, etc. are all filled up, only the A100-80G can run. The training configurations used in our open source code are the minimum configurations that can run. There are currently two solutions:

  • Truncate the gradient of part of the video frame. There is an option in our open source config stop_prev_grad, which will infer all previous frames in no_gradmode, and only the current frame will have gradient return.

  • Another solution is to use the sequence training scheme used in SOLOFusion, StreamPETR and other methods to save video memory and time. We may try it in the future.

in conclusion

In this paper, we propose SparseBEV, a fully sparse single-stage 3D object detector. SparseBEV improves the adaptability of sparse query models through three core modules: scale-adaptive self-attention, adaptive spatiotemporal sampling, and adaptive fusion, and achieves performance that is close to or even better than dense BEV-based methods. In addition, we also proposed a Dual-branch structure for more efficient long-term processing. SparseBEV achieves both high accuracy and high speed in nuScenes. We hope that this work can shed some light on the sparse 3D detection paradigm.

[1] Wang Y, Guizilini V C, Zhang T, et al. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries[C]//Conference on Robot Learning. PMLR, 2022: 180-191.

[2] Liu Y, Wang T, Zhang X, et al. Petr: Position embedding transformation for multi-view 3d object detection[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 531-548.

[3] Li Z, Wang W, Li H, et al. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 1-18.

[4] Park J, Xu C, Yang S, et al. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection[J]. arXiv preprint arXiv:2210.02443, 2022.

[5] Zong Z, Jiang D, Song G, et al. Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction[J]. arXiv preprint arXiv:2304.00967, 2023.

[6] Wang S, Liu Y, Wang T, et al. Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection[J]. arXiv preprint arXiv:2303.11926, 2023.

[7] Yang C, Chen Y, Tian H, et al. BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 17830-17839.

[8] Huang J, Huang G, Zhu Z, et al. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view[J]. arXiv preprint arXiv:2112.11790, 2021.

[9] Gao Z, Wang L, Han B, et al. Adamixer: A fast-converging query-based object detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 5364-5373.

[10] Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6202-6211.

—END—

Efficiently learn 3D vision trilogy

The first step is to join the industry exchange group and maintain the advancement of technology.

At present, the workshop has established multiple communities in the direction of 3D vision, including SLAM, industrial 3D vision, and autonomous driving. The subdivision groups include: [ Industrial direction ] 3D point cloud, structured light, robotic arm, defect detection, 3D measurement, TOF, camera calibration, comprehensive group; [ SLAM direction ] multi-sensor fusion, ORB-SLAM, laser SLAM, robot navigation, RTK|GPS|UWB and other sensor exchange groups, SLAM comprehensive discussion group; [ autonomous driving direction ] depth estimation, Transformer , millimeter wave|lidar|visual camera sensor discussion group, multi-sensor calibration, automatic driving comprehensive group, etc. [ 3D reconstruction direction ] NeRF, colmap, OpenMVS, etc. In addition to these, there are also communication groups for job hunting, hardware selection, and visual product implementation. You can add the assistant on WeChat: dddvisiona, note: add group + direction + school | company, the assistant will add you to the group.

1a586ad32c6aa962f43c72f24aa35d30.jpeg
Add assistant WeChat: cv3d007 to join you in the group
The second step is to join Knowledge Planet and get your questions answered in a timely manner.

Video courses for the field of 3D vision (3D reconstruction, 3D point cloud, structured light, hand-eye calibration, camera calibration, laser/visual SLAM, autonomous driving, etc.), source code sharing, knowledge point summary, introductory and advanced learning routes, latest paper sharing , question answering , etc., and algorithm engineers from various major manufacturers provide technical guidance. At the same time, Planet will work with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans integrating technology, employment, and project docking. 6,000+ Planet members will work together to create a better AI world. Progress, Knowledge Planet Entrance: "3D Vision from Beginner to Master"

Learn 3D vision core technology, scan and view, and get an unconditional refund within 3 days 907690a4d046f8eaa1287417a65a9996.jpeg
High-quality tutorial materials, answers to questions, and help you solve problems efficiently
The third step is to systematically learn 3D vision, deeply understand and run the module knowledge system

If you want to study systematically in a certain subdivision of 3D vision [from theory, code to practice], we recommend the 3D vision quality course learning website: www.3dcver.com

Scientific research paper writing:

[1] China’s first tutorial on scientific research methods and academic paper writing for 3D vision

Foundation Course:

[1] In-depth explanation of important C++ modules for three-dimensional vision algorithms: from basic entry to advanced

[2] Linux embedded system tutorial for 3D vision [theory + code + practical]

[3] How to learn camera model and calibration? (Code + actual combat)

[4] ROS2 from entry to mastery: theory and practice

[5] Thoroughly understand dToF radar system design [theory + code + practical]

Industrial 3D Vision Direction Course:

[1] (Second issue) Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

[2] Nanny-level linear structured light (monocular & binocular) 3D reconstruction system tutorial

[3] Robotic arm grabbing from entry to practical course (theory + source code)

[4] Three-dimensional point cloud processing: algorithm and practical summary

[5] Thoroughly understand the point cloud processing tutorial based on Open3D!

[6] 3D visual defect detection tutorial: theory and practice!

SLAM direction courses:

[1] In-depth analysis of the principles, codes and actual combat of 3D laser SLAM technology for the field of robotics

[1] Thorough analysis of laser-vision-IMU-GPS fusion SLAM algorithm: theoretical derivation, code explanation and practical combat

[2] (Second issue) Thoroughly understand 3D laser SLAM based on LOAM framework: source code analysis to algorithm optimization

[3] Thoroughly understand visual-inertial SLAM: In-depth explanation of VINS-Fusion principles and source code analysis

[4] Thoroughly analyze the key algorithms and actual combat of indoor and outdoor laser SLAM (cartographer+LOAM+LIO-SAM)

[5] (Second Issue) ORB-SLAM3 theoretical explanation and code analysis

Visual 3D reconstruction

[1] Thoroughly complete perspective 3D reconstruction: principle analysis, code explanation, and optimization improvements )

Autonomous driving course:

[1]  In-depth analysis of vehicle-mounted sensor spatial synchronization (calibration) for the field of autonomous driving

[2]  China’s first Transformer principle and practical course for the field of autonomous driving target detection

[3] Monocular depth estimation method: algorithm review and code implementation

[4] Full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single modal + multimodal/data + code)

[5] How to deploy deep learning models into actual projects? (Classification + Detection + Segmentation)

at last

1. Recruitment of authors for 3D visual article submissions

2. Recruitment of main teachers for 3D vision courses (autonomous driving, SLAM and industrial 3D vision)

3. Top conference paper sharing and 3D vision sensor industry live broadcast invitation

Guess you like

Origin blog.csdn.net/Yong_Qi2015/article/details/132928988