论文速读 -- BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation

论文速读 – BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation

References :
1. BEVFusion: Multi-task and multi-sensor fusion based on unified BEV characterization
2. BEVfusion

Summary

Multi-sensor fusion is critical for accurate and reliable autonomous driving systems. A recently proposed method based on point-level fusion: augmenting lidar point clouds with camera features. However, camera-to-LiDAR projection discards the semantic density of camera features, reducing effectiveness, especially for semantic-oriented tasks such as 3D scene segmentation. This paper breaks the deep-rooted concept and proposes BEVFusiona method 多任务多传感器融合框架that uses a unified BEV to represent multimodal features in the space, which well preserves geometric and semantic information. Optimize BEVpooling, determine and solve key efficiency bottlenecks in view conversion, and reduce latency by 40 times. BEVFusion seamlessly supports different 3D perception tasks with almost no architectural changes. Achieved 1.3% mAP and NDS improvement on 3D target detection of nuScenes dataset, 13.6% mIoU improvement on BEV segmentation, and reduced computing cost by 1.9 times.

1. Introduction

Different sensors provide complementary signals: for example, cameras capture rich semantic information, lidars provide precise spatial information, and mmWave radars provide instant velocity estimates.

Due to the great success in 2D perception, the first approach is to project the lidar point cloud onto the camera image plane and use the 2D CNN处理RGB-Ddata. However, this scheme 引入了严重的几何畸变,对于面向几何的任务(如3D目标识别)的效率较低. Recent sensor fusion approaches have taken another direction, augmenting lidar point clouds with semantic annotations, CNN features, or virtual points in 2D images 虚拟点, which are then applied to existing lidar-based detectors to predict 3D bounding boxes. 此方式不适用于面向语义的任务,如BEV地图分割.

related work

Based on radar 3D perception . PointNet, SparseConvNet; anchor-free; one-stage, two-stage and other schemes.
Image-based 3D perception . Image viewing angle detection; DETR detection head-based object query in 3D space; LSS expansion scheme; multi-view; BEVDet4D, BEVFormer,PETRv2timing scheme; BEVFormer, CVT and Ego3RT multi-head attention mechanism related schemes;

Multi-sensor fusion . It is mainly divided into 提议级(proposal-level) and 点级(point-level) fusion methods. MV3D projects 3D space proposal objects onto the image, and then extracts ROI features. Proposal-level fusion methods are all 以目标为中心limited and cannot be simply generalized to other tasks such as BEV map segmentation. The point-level fusion method is to draw the semantic features of the image onto the laser point, and then perform laser target detection. 既是目标为中心,又是几何为中心. PointPainting, PointAugmenting, MVP, FusionPainting, AutoAlign and FocalSparseCNNis the laser input layer decoration, Deep Continuous Fusion and DeepFusionand is the feature layer decoration.

Multi-task learning . Multi-task CNNs have made good progress in 2D computer vision, including joint object detection, instance segmentation, pose estimation, and human-computer interaction. Recent concurrent studies M2BEV和BEVFormer、BEVerseperform 3D object detection and BEV segmentation jointly.


2. Method

Given different perceptual inputs, 首先使用特定于模态的编码器来提取其特征; convert multimodal features into one 统一的BEV表征, which preserves geometric and semantic information at the same time; for view conversion efficiency bottlenecks, it can be processed by precomputation and interval redection 加速BEV池化; then, based on the volume The product BEV encoder is applied to the unified BEV feature to alleviate the local error between different features; finally, some task-specific heads are added to support different 3D scene understanding tasks.
insert image description here

2.1 Unified representation

Two conditions need to be met for the shared representation: 1) all sensors are easily switchable without information loss; 2) applicable to different task types.
LiDAR is projected to the camera . Project the LIDAR point cloud onto the RGB image to form a 2.5D sparse depth, but this conversion loses geometric information (because two pixels are adjacent to the image, and may be far away in 3D space), which is not conducive to the 3D object detection task.
The camera projects to the lidar . Most of the fusion algorithms use the corresponding image features to modify the laser point cloud. But semantic information will also be lost, and only less than 5% of the image features can be matched to the point cloud. Similar defects also have the same problem in object query.
BEV perspective . Friendly to almost all perceptual tasks, since the output space is also in BEV. More importantly, the conversion to BEV maintains both 几何结构(features from lidar) and 语义密度(features from camera).

2.2 Efficient image-to-BEV conversion

The camera-to-BEV transformation is important because the depth associated with each camera image feature pixel is inherently ambiguous. Following the method of LSSand BEVDet, the discrete depth distribution for each pixel is predicted explicitly. A feature map dimension of a camera is NHWD (N represents the number of cameras, the size of the HW feature map, and D represents the discrete depth), using the BEV pooling operator to fuse the features and flatten them in the z direction. We propose to optimize the BEV pooling process by precomputation and intermittent reduction.

precomputed . The first step is to associate each point of the camera feature point cloud with the BEV grid, 因为摄像头特征点云的坐标是固定的(as long as the camera's intrinsic and extrinsic parameters remain constant, usually after proper calibration). Based on this, the 3D coordinates and BEV grid index of each pixel are precomputed. Sort all points according to grid index and record each point rank. During inference, all feature points only need to be reordered according to the precomputed ranking. This 缓存机制reduces the latency associated with the mesh from 17ms to 4ms.

intermittent lowering . After grid association, all points of the same BEV grid are contiguous in the tensor representation. The next step in BEV pooling is to 对称函数aggregate features within each BEV grid by (e.g., mean, max, and sum). The existing implementation method first calculates the 前缀和(prefix sum) of all points, and then subtracts the boundary value where the index changes. However, prefix sum operations require tree reduction on the GPU and generate many unused partial sums (since only boundary values ​​are required), both of which are inefficient. In order to accelerate feature aggregation, the article implements a dedicated GPU kernel that is parallelized directly on the BEV grid: each grid is assigned a GPU thread, which calculates its 间歇和(interval sum) and writes the result back. This kernel eliminates dependencies between outputs (no multilevel tree reduction required) and avoids writing partial sums to DRAM, reducing the latency of feature aggregation from 500ms to 2ms.

2.3 Convolution Fusion

After all sensor features are transformed into a shared BEV representation, they can be easily fused (eg, concat operation). Although in the unified space, the lidar BEV features and image BEV features still exist due to the inaccurate depth when the viewing angle is transformed 空间误差. Finally, we use a convolution-based BEV encoding module (with some stagger modules) to compensate for these local biases.

2.4 Multitasking headers

detection head . Similar to the CenterPoint method.
split head . Since there may be intersections between categories, it is regarded as multiple binary segmentation tasks; use focal loss.

3. Results

insert image description here


4. Experiment

4.1 Model :
Image Backbone: swin-transformer, Laser Backbone: VoxelNet;
FPN fuses multi-scale images to 1/8 input size;
downsamples images to 256*704, point cloud target detection resolution 0.075m, segmentation resolution 0.1 m;
grid sampling grid sampler, to meet the task head of different resolutions.

4.2 Training :
Encoder without freezing images, overall training; images and lasers are all enhanced with data.

4.3 Dataset :
nuScenes

V. Analysis

weather and light . There is noise in the lidar in rainy days, and the camera has better robustness in different weathers. The performance of BEVfusion is 10.7mAP higher than that of CenterPoint. For segmentation, even with only the camera, BEVfusion performs better than CenterPoint, but at night the camera performs poorly.

size and distance . BEVFusion brings consistent performance gains for objects of different sizes; slightly larger performance gains for small or distant objects, which have fewer LiDAR points but benefit from dense image information.

Sparse radar . This model is not a point-level fusion method and does not need to rely heavily on lidar information.

Multi-task learning . Jointly training different tasks has a negative impact on each individual task, which we generally refer to as "negative transfer". A separate BEV encoder can alleviate this problem to some extent.

Ablation studies .

Guess you like

Origin blog.csdn.net/weixin_36354875/article/details/131289553