Fully Sparse Fusion for 3D Object Detection

Abstract

当前的3D目标检测的检测头都是基于lidar的bev开发的，bev的featuremap都是dense的，和距离成平方的复杂度关系。目前在纯lidar的目标检测下，已经有很多稀疏的方法（比如FSD/Voxelnext/Flatformer等等），适用于长距离的场景。这篇paper，把稀疏的方法拓展到视觉融合的路线中。具体的利用图像中的instance segmentation的方法，并使用这些instance进行query。在nuscenes和argo数据集中有sota，速度比其他的3D 多模态框架快2.7倍。

1. Introduction

多模态的优点：

lidar，位置精准，小物体或者远处的物体，无法识别；

camera，丰富的语义信息，但是缺乏深度信息；

融合后的网络，效果更好，更鲁棒；（bevfusion, bevfusion, transfusion）

目前有不少稀疏的网络，（fully sparse detection fsd, voxelnext, flatformer, super sparse detectionssd）

这篇paper，关注在如何维持稀疏的结构的同时，引入2D的image的融合。

本文多模态的方法：

In particular, we take the freebie from the well-studied 2D instance segmentation. The 2D instance masks are viewed as 2D queries, which are lifted to 3D space by gathering the corresponding point cloud in the frustum. In this way, these queries generated from images can be aligned with the queries from the LiDAR side, establishing a unified multi-modal input for further processing.

2. Related Work

2.1 camera based

BEV based method，

LSS: LSS, BevDet4D, BevDepth

LSS (Lift, Splat, Shoot) 论文+源码万字长文解析 - 知乎

BEV视角下3D目标检测范式—BEVDet论文+源码解读 - 知乎

针对每个pixel预测一系列的深度，以及深度对应的权重；利用相机的内外参，通过像素坐标和深度，把图像特征投影到3D的pillar中，通过bev_pool得到bev的特征，然后进行后续的任务；

Transformer: bevformer, FrustumFormer, BEVFormer v2

BEVFormer

万字长文理解纯视觉感知算法 —— BEVFormer - 知乎

在bev平面，每个pillar内，生成固定数量和高度的点，利用相机内外参，投影到image feature平面，作为anchor点，利用attention机制，进行multi scale的image特征的提取，以生成bev特征。

Query based method，DETR3D，PETR, Object as Query

DETR3D

2.2 lidar based

全dense：pointpillar, voxelnet

局部稀疏：second,centerpoint,

全稀疏：fsd, ssd, voxelnext,flatformer

FSD，

全稀疏的3D物体检测器 - 知乎

2.3 multi modal 3D detection

dense bev: bevfusion，在bev层面，对特征进行融合；

point-level: pointpainting, pointaugmenting，语义和特征的点级的融合；

instance-level: CLOCs, DeepFusion, Pixel-instance，instance lev的融合；

3. Preliminary

本文是基于FSD，fully spase 3D detector这篇作为baseline的。

3D instance segmentation

1. 获取sparse voxel feature

2. feature map到点上，同时concat点到voxel的offset，形成点的feature

3. 输入两个head，分别进行前景背景点的分类，以及中心点的投票（votenet）(前景点和object中心的offset)

4. 保留的前景点进行中心点的预测，让中心点相邻的点group到一起

5. 获取3D instance，并关联到voted center，形成cluster

sparse prediction

每一个group都有，group center, pair-wise feature, group feature aggregation

基于instance的点云特征，通过mlp，进行3D box的预测

4. Methodolodgy

4.1 overall architecture

4.2 query generation

lidar query, fsd的中间结果 3d instances

camera query，instance mask生成初始的椎体，然后进行crop，并获得点云的cluster

4.3 Bi-modal Query Refinement

各自通过vfe，生成reference box，结果格式一致；

通过box,crop points，然后提取box feature，用于最终的预测；

4.4 query label assignment

落在gt_boxes内部的queries是positive的。

但是camera的queries不是很容易落入gt_boxes内：

前景和背景的重叠

2D instance本身的错误

投影误差

3D round:

按照query in box strategy

2D round:

没有分配标签的camera queries, 计算2D bbox和3D bbox在2D上的投影的iou，通过最大的iou，选取某个queries对应的gt.

剩下的就是negative的。

4.5 detecthon head and loss

LiDAR reference bounding boxes, camera reference bounding boxes, and final bounding boxes,

The regression branch takes each query’s feature as input, outputs (∆x, ∆y, ∆z, log w, log l, log h, sin ry, cos ry). (∆x, ∆y, ∆z) means the predicted offset from query’s center. (w, l, h) means the dimension of the predicted box. ry is the heading angle in the yaw direction. We use L1 loss for regression branches and focal loss for classification branches, respectively

5. Experiments

在nuscenes和argoverse2数据集上进行了实验。

超过了一些bevfusion，deeptneratction的sota算法。

对比了其他的融合方式：pointpainting, feature painting, virtaul point

为什么fsf更优：

融入了instance的信息

输出了一致的表现形式，然后最总共享了预测头