Fasterbev real-time bird's-eye view perception

mmdetection3d

Install mmcv-full

pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu101/torch1.8.1/index.html

install mmdet

pip install mmdet==2.14.0

install mmseg

pip install mmsegmentation==0.14.1

 If you compile mmdet3d at this time, you may not be able to compile it, the gcc version problem

gcc -V or gcc -version; for 7.5.0, at least cuda11 and cuda10 must be built, and gcc must be downgraded to 5.5. Pro-test 7.5 compilation error.

Downgrade gcc

Install gcc5.5

sudo apt install gcc-5 g++-5

Establish a soft connection 

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 80 --slave /usr/bin/g++ g++ /usr/bin/g++-5

Execute the command again gcc -vand g++ -vcheck that the version of gcc and g++ is 5.5

Return to gcc7.5.0, switch multiple versions

sudo update-alternatives --config gcc

Just select the label

compile mmdet3d

cd mmdetection3d
pip install -v -e .

 mmdetection3D RuntimeError: Error compiling objects for extension

solve:

 

 There is a problem with the third line. The way of writing is incorrect, so you can’t find the location of CUDA_HOME. You need to modify it to the style of the fourth line, and you can enter the setup link normally. 

 Take a look at this "Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception" article, in which it is found that BEV representation can have better performance without expensive view transformation or depth representation. Performance, M 2 BEVM^2BEV M2BEV is used as Baseline in this paper, and then further introduced (1) powerful data enhancement strategy for image and BEV space to avoid overfitting (2) multi-frame feature fusion mechanism to utilize temporal information (3) Optimized deployment-friendly view transformations to speed up inference. It is mentioned in the article that the M1 model (R18@256 × 704) can run at 50 FPS on the Tesla T4 platform. The code has also been open sourced on Github.

paper:https://arxiv.org/pdf/2301.07870.pdf

code:https://github.com/sense-gvt/fast-bev

1. Main contributions

An accurate 3D perception system is crucial for autonomous driving. The BEV approach using pure vision can replace the traditional expensive lidar. This paper is based on the idea of ​​M2BEV. In the paper, it is considered that the depth distribution along the camera light is uniform during the image-to-BEV view conversion process. And propose a stronger and faster fully convolutional BEV perception framework without using expensive view transformers or depth representations. The main contributions are as follows:

    We verify the effectiveness of two techniques on M2BEV: strong data augmentation and multi-frame temporal fusion, enabling Fast-BEV to achieve state-of-the-art performance.

    We propose two accelerated designs: Precompute projection indices and project them to the same voxel features, enabling Fast-BEV to be easily deployed on-board chips with fast inference speed.

    Fast-BEV proposed in this paper is the first deployment-oriented work on real-time in-vehicle BEV perception. We hope our work can shed light on industrial-grade, real-time, in-vehicle BEV perception.

2. Details

The following figure is a schematic diagram of the entire framework. We can see that the model framework is basically completed based on M2BEV, but this article has improved the problems of M2BEV. Let’s take a look at it in detail below.

  • 1. Lightweight view conversion, 2D to 3D voxel space
  • 2. Multi-scale image encoder, bev encoder enhances the multi-channel weight information of feature map
  • 3. The data of cv2.img and look around bev is strengthened to avoid overfitting
  • 4. Multi-frame feature fusion mechanism,
  • In the RTX2080ti single card training nuscenes data set, the FPS reached 52.6; the NDS reached 0.473.

The overall pipeline of this article:

insert image description here

  •  1) Image coding part: output multi-scale image features
  • 2) 2D to 3D part: use the lookup table and realize the conversion from 2D to a unified 3D space based on camera ray
  • 3) Data augmentation: Realizing data augmentation in the image and BEV space is actually to adjust the internal and external parameters of the camera
  • 4) Time-series data fusion: learn from the time-domain data fusion strategy in BEVDet4D, and increase the number of frames involved in data fusion
     

 M2bev architecture:

 A powerful data augmentation is added to avoid over-fitting, and a multi-frame feature fusion mechanism is added to exploit temporal information to achieve state-of-the-art performance. In addition, we also propose to optimize the 3-view transformation to make it more suitable for deployment on vehicular platforms.

M2BEV

 fast bev

  • Data augmentation. We empirically found that there is a severe overfitting problem in the later training of M2BEVM2BEV. This is because no data augmentation was used in the original M2BEVM2BEV. Inspired by recent work [18, 37], we add strong 3D augmentations, such as random flips, rotations, etc., on the image and BEV spaces. See Section 3.3 for details.
  • Time to merge. In practical autonomous driving scenarios, the inputs are time-continuous with huge complementary information. For example, a pedestrian who was partially occluded in the current frame may have been fully visible several frames in the past. Therefore, we extend M2BEV from space-only to space-time by introducing a temporal feature fusion module, similar to [31, 20]. More specifically, we train Fast-BEV in an end-to-end manner using current frame BEV features and stored historical frame features as input.
  • Optimizing View Transformation: We find that the projection from image space to voxel space plays a dominant role in latency. We propose to optimize the projection from two perspectives: (1) We precompute a fixed projection index and store it as a static lookup table, which is very efficient during inference. (2) We let all cameras project to the same voxel to avoid expensive voxel aggregation. Our method is not based on the improved view transformation scheme of Lift-Splat-Shoot [37, 18, 1], because it does not need to use complex and difficult DSP/GPU parallel calculations, but is fast enough to use only CPU calculations, which is very convenient deploy. See Section 3.5 for more details.

3. Temporal Fusion

 Inspired by BEVDet4D and BEVFormer, the authors also introduce historical frames into the current frame for temporal feature fusion. Through spatial alignment operation and cascade operation, the features of the historical frame are fused with the corresponding features of the current frame. Temporal fusion can be considered as frame-level feature enhancement, and longer time series within a certain range can bring more performance gains.

Specifically, the current frame is sampled with three historical keyframes; each keyframe has an interval of 0.5s, and this paper adopts the multi-frame feature alignment method in BEVDet4D. As shown in Figure 6, after obtaining the four aligned BEV features, they are directly concatenated and fed to the BEV encoder. In the training phase, the image encoder is used to extract historical frame features online. In the testing phase, the historical frame features can be saved offline and directly retrieved for acceleration. Comparing with BEVDet4D and BEVFormer, BEVDet4D only introduces a historical framework, which we believe is insufficient to utilize historical information.

Fast BEV uses three history frames, which significantly improves performance, and BEVFormer slightly outperforms BEVDet4D by using two history frames. However, due to memory issues, historical features are separated without gradients during the training phase, which is not optimal. Furthermore, BEVFormer uses RNN style to fuse features sequentially, which is inefficient. In contrast, all frames in Fast BEV are trained in an end-to-end manner, which is easier to train compared to ordinary GPUs!

4. Ablation Experiment

Dataset description: Fast BEV is evaluated on the nuScenes dataset, which contains 1000 autonomous driving scenes of 20 seconds each. The dataset is split into 850 scenes for training/validation and the remaining 150 scenes for testing. While the nuScenes dataset provides data from different sensors, we only use camera data. The camera has six views: front left, front, front right, back left, back, back right.

Evaluation Metrics. To comprehensively evaluate the detection task, the standard evaluation metrics of mean precision (mAP) and nuScenes detection score (NDS) are used for 3D object detection evaluation. Also, to calculate the accuracies of the corresponding aspects (e.g., translation, zoom, orientation, velocity, and attributes), the mean translation error (mATE), mean scaling error (mASE), mean orientation error (mAOE), mean velocity error (mAVE) are used , and mean attribute error (mAAE) as a metric.

5. Scalable in practice:

With the development of technology, many autopilot manufacturers have begun to abandon lidar and only use pure cameras for perception. As a result, there is no depth information in the large amount of data collected from real vehicles. In practical development, model amplification or data amplification is usually based on data collected from real vehicles to exploit the data potential to improve performance. In this case, solutions based on depth monitoring encounter bottlenecks, while Fast BEV does not introduce any depth information and can be better applied!

Guess you like

Origin blog.csdn.net/weixin_64043217/article/details/129102103