【Paper Notes】Scene as Occupancy

Original link: https://arxiv.org/abs/2306.02851

1 Introduction

  Compared with traditional 3D box object representation, using 3D occupancy representation is geometry-aware because 3D box representation simplifies the shape of the object. In addition, existing vision-based methods rarely consider temporal information; single-stage methods lack a coarse-to-fine refinement process.
  This paper proposes OccNet, a multi-view image-based method that includes a cascaded voxel decoder, uses temporal information to reconstruct 3D occupancy, and can be connected to a task head to support different autonomous driving tasks. The core of OccNet is a compact and representative 3D occupancy embedding to describe 3D scenes. OccNet uses a cascade approach to decode 3D occupancy features from BEV features. The decoder uses voxel-based temporal self-attention and spatial cross-attention to gradually recover the height information.
  OccNet supports various tasks including detection, segmentation and planning. To compare various methods, this paper proposes OpenOcc, a 3D occupancy benchmark with dense high-quality annotations based on nuScenes. It annotates object motion as directed flows to support planning tasks.
  OccNet can significantly increase performance in semantic scene completion and 3D target detection; in terms of planning, OccNet can significantly reduce collisions compared to planning strategies based on BEV segmentation or 3D bounding boxes.

3. Method

  OccNet in this article is divided into two stages: occupancy reconstruction and occupancy utilization. The connected part of the two phases is a unified expression of the driving scene and is called an occupancy descriptor.
Insert image description here  Occupancy reconstruction : The goal is to obtain the occupancy descriptor to support downstream tasks. Establishing voxel features directly from images has a huge computational overhead, but only establishing BEV features is not enough to perceive height. The OccNet in this article strikes a balance between the above two solutions and obtains higher performance at an acceptable cost. First, extract multi-view image features F t F_tFt, and with the BEV feature of the historical frame B t − 1 B_{t-1}Bt1And the BEV query Q t Q_t of the current frameQtInput the BEV encoder together ( same as BEVFormer ) to get the BEV feature B t B_t of the current frameBt. Then, the image features, historical frames, and BEV features of the current frame are decoded into occupancy descriptors through a cascaded BEV decoder.
  Occupancy utilization : Based on reconstructed occupancy descriptors, 3D detection and 3D semantic scene completion are possible. Then the height of the 3D occupied grid and bounding box is compressed to obtain the BEV segmentation map. The BEV segmentation map can be directly input into the motion planning head together with the advanced instruction sampler, and the self-vehicle trajectory can be obtained through the argmin and GRU modules.

3.1 Cascaded voxel decoder

  BEV to concatenated voxels : This paper divides the process of reconstructing BEV features into voxel features into NNN steps, each step gradually increases the voxel height and decreases the number of channels. As shown in the figure,B t − 1 B_{t-1}Bt1and B t B_tBtPromoted to V t − 1 , i ′ V'_{t-1,i} by FFNVt1,iSum V t , i ′ V'_{t,i}Vt,i, and then through the iiThe i- layer voxel decoder obtains the modifiedV t , i ′ V'_{t,i}Vt,i. Each voxel decoder consists of voxel-based temporal self-attention and voxel-based spatial cross-attention, using V t − 1 , i V_{t-1,i} respectively.Vt1,iand F t F_tFtDenote V t , i ′ V'_{t,i}Vt,i. Finally, the occupancy descriptor V t V_t is obtainedVt.
  Voxel-based temporal self-attention : voxel features of a given historical frame V t − 1 , i ′ V'_{t-1,i}Vt1,i, first based on the vehicle position and the current frame occupancy features V t , i ′ V'_{t,i}Vt,iAlignment. To reduce computation, voxel-based 3D deformable attention is designed so that each query only interacts with locally interesting voxels.
  3D deformable attention : Expand 2D deformable attention into 3D form. Given voxel features V t − 1 , i ′ V'_{t-1,i}Vt1,i, query feature qqq and 3D reference pointppp , the 3D deformable attention is as follows: 3D-DA ( q , p , V t , i ′ ) = ∑ m = 1 MW m ∑ k = 1 KA mk W k ′ V t , i ′ ( p + Δ pmk ) \text{3D-DA}(q,p,V'_{t,i})=\sum_{m=1}^MW_m\sum_{k=1}^KA_{mk}W'_kV'_{t ,i}(p+\Delta p_{mk})3D-DA ( q ,p,Vt,i)=m=1MWmk=1KAmkWkVt,i(p+Δ pmk) whereMMM is the number of attention heads,KKK is the number of sampling points,W m W_mWmand W k ​​′ W'_kWkis the learning weight, A mk A_{mk}Amkis the normalized attention weight, p + Δ pmk p+\Delta p_{mk}p+Δ pmkis the learnable 3D sample point location (features are sampled from voxels using trilinear interpolation).
  Voxel-based spatial cross-attention : voxel features V t , i ′ V'_{t,i}Vt,iWill work with multi-scale image features F t F_tFtInteraction via 2D deformable attention, ii .The decoder of layer i directly samples several 3D points for each voxel and projects them to the image plane sampling features.

3.2 Occupy utilization on multiple tasks

  Semantic scene completion : Use MLP to predict semantic labels for each voxel with Focal loss. Additionally, flow velocity is estimated for each voxel grid using the flow head with L1 loss.
  3D object detection : compress occupancy descriptors into BEVs and predict 3D bounding boxes using a query-based detection head (similar to DeformableDETR).
  BEV segmentation : Map segmentation and semantic segmentation are also predicted from BEV features. The former uses a drivable area segmentation head and a lane head to express the map, while the latter uses a car segmentation head and a pedestrian segmentation head for semantic segmentation.
  Motion planning : First, convert the semantic scene completion results and 3D bounding boxes to BEV (compression height), and the value of each BEV grid is 0 or 1 (indicating whether it is occupied), and input the safety cost function to calculate the sampling trajectory Safety costs, comfort costs, and progress costs. All candidate trajectories are sampled from random velocity, acceleration and curvature. Considering high-level instructions (forward, turn left, turn right), the lowest-cost trajectory corresponding to the instruction is output. Same as ST-P3, the front view features are also used to refine the trajectory with GRU to obtain the final trajectory.

4. OpenOcc: 3D occupancy benchmark

  In order to fairly compare various solutions, this article proposes OpenOcc's 3D occupancy benchmark based on nuScenes.

4.1 Benchmark overview

  Generate high-quality annotated occupancy data using sparse lidar point clouds and 3D bounding boxes. The labeled categories include foreground and background, and the foreground object voxels are labeled with flow velocity.

4.2 High-quality annotation generation

  Accumulate background and foreground independently : To generate dense representations, it is intuitive to accumulate point clouds of keyframes and intermediate frames. However, due to the existence of moving objects, accumulation through coordinate transformation is problematic. Based on the 3D bounding box, this paper divides the lidar point cloud into static background and foreground objects, and accumulates dense point clouds in the world coordinate system and object coordinate system respectively.
  Label generation : First voxelize the accumulated dense point cloud, and then label the voxels according to the label information of most label points. In addition, based on bounding box velocity annotation, voxels are also annotated with flow velocity for motion planning tasks. For voxels that do not contain annotation points (from intermediate frames), annotation is performed based on surrounding voxels. Finally refinement is done, such as filling in holes in the road to improve quality. In addition, by tracing camera rays, voxels invisible to the camera are annotated, making the task of camera input more reasonable.

5. Experiment

5.1 Main results

  Semantic scene completion : OccNet can exceed the performance of other methods.
  Occupancy for lidar segmentation : Semantic scene completion is equivalent to lidar segmentation as the voxel size approaches 0. When using only image input, OccNet can approach the performance of lidar segmentation methods.
  Occupied for 3D detection : In scene completion tasks, the positions of foreground objects can be roughly returned, which can help with 3D detection. When 3D detection and scene completion are performed simultaneously, BEV-based methods, voxel-based methods, and occupancy-based methods (OccNet) can all improve performance. However, due to the larger voxels, mATE and mASE will increase slightly during joint training.
  Occupy pre-training for 3D detection and BEV segmentation : OccNet has better detection performance than FCOS3D when pre-trained on a small part of the data set; compare the impact of different pre-training tasks on BEV segmentation performance, occupancy (scene completion) Pre-training can achieve higher performance than 3D detection pre-training.
  Occupancy for planning : When the occupancy prediction results using OccNet are converted into BEV segmentation results and used for planning, the BEV segmentation results have a lower collision rate than the BEV segmentation results directly estimated using ST-P3.

5.2 Discussion

  Model efficiency : Comparing BEV-based methods, voxel-based methods and occupancy-based methods (OccNet), OccNet has the best performance and moderate efficiency.
  Irregular objects : Representing irregular objects such as construction vehicles and traffic signs with 3D bounding boxes is difficult and imprecise. This paper converts 3D bounding boxes into voxels and compares their IoU on irregular objects. It is found that occupancy expression can more appropriately express irregular objects. In addition, reducing voxel size and occupancy becoming more granular can increase the gap with 3D detection.
  Comparison of dense and sparse occupancy : Compared with sparse occupancy, dense occupancy can express the scene more completely and contain more information, so it performs better on downstream tasks.

appendix

A. Evaluation indicators

  Semantic scene completion (SSC) metric : Semantic scene completion predicts the semantic label of each voxel. The evaluation index is the average intersection-to-union ratio (mIoU): mIoU = 1 C ∑ c = 1 C TP c TP c + FP c + FN c \text{mIoU}=\frac{1}{C}\sum_{c=1 }^C\frac{\text{TP}_c}{\text{TP}_c+\text{FP}_c+\text{FN}_c}mIoU=C1c=1CTPc+FPc+FNcTPcwhere CCC is the number of categories. Also consider class-independentIoU geo \text{IoU}_{geo}IoUgeoto evaluate the geometric reconstruction of the scene.
  3D target detection index : the same as nuScenes official.
  Motion planning indicators : The L2 distance between the real trajectory and the planned trajectory is used to measure regression accuracy; the collision rate is used to measure safety.

C. OccNet implementation details

  Feature transformation of voxel decoder : to convert voxel features V t , i ′ ∈ RZ i × H × W × C i V'_{t,i}\in\mathbb{R}^{Z_i\times H\ times W\times C_i}Vt,iRZi×H×W×Ci转换为 V t , i + 1 ′ ∈ R Z i + 1 × H × W × C i + 1 V'_{t,i+1}\in\mathbb{R}^{Z_{i+1}\times H\times W\times C_{i+1}} Vt,i+1RZi+1×H×W×Ci+1, this article uses MLP to transform feature dimensions. In spatial cross-attention, image features are transformed into C i C_i through MLPCidimensions.

D. More details about OpenOcc

  Accumulation of foreground objects : Since non-key frames do not have bounding box annotations, this article uses key frame annotations for linear interpolation and then accumulates foreground objects.
  Data set generation process :

  1. Occupancy data is generated based on the annotation of foreground points and background points. At this time, there are still some unlabeled voxels from the intermediate frames.
  2. Based on the generated occupancy, generate labels for some unlabeled occupancies (according to the text, labels are determined based on neighboring voxels);
  3. Occupations that remain unlabeled are considered noise and discarded;
  4. Post-processing, such as filling in small holes, to ensure the integrity of the scene.

E. More experiments

  Ablation study of temporal self-attention frame number : Increasing the frame number can slightly improve performance, but the performance will gradually saturate.
  Evaluation of planned occupancy metrics : Planning using predicted occupancy or real occupancy both has better performance than planning using corresponding bounding boxes.
  Pre-training for planning : Experiments show that occupancy pre-training for planning tasks does not improve performance compared to pre-training for detection. Therefore, it is necessary to directly use occupied scene completion results instead of pre-trained features for planning.

Guess you like

Origin blog.csdn.net/weixin_45657478/article/details/132247560