Tsinghua University & Nvidia's latest | Occ3D: a general and comprehensive large-scale 3D occupancy prediction benchmark

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Click to enter→ The Heart of Autopilot【Occupy Network】Technical Exchange Group

Background Reply [3D Inspection Review] Get the latest 3D inspection review based on point cloud/BEV/image!

Author|Autobot

Editor|The Heart of Autopilot

Autonomous driving perception requires modeling 3D geometry and semantics. Existing methods usually focus on estimating 3D bounding boxes, ignore finer geometric details, and struggle with general, out-of-vocabulary objects. To overcome these limitations, this paper introduces a novel 3D occupancy prediction task that aims to estimate detailed occupancy and semantics of objects from multi-view images. To facilitate this task, the authors develop a label generation pipeline to generate dense, perceptual labels for a given scene. The pipeline includes point cloud aggregation, point labeling and occlusion processing. The authors constructed two benchmarks based on the Waymo open dataset and the nuScenes dataset, resulting in the Occ3D Waymo and Occ3D nuScene benchmarks. Finally, the authors propose a model called a "coarse-to-fine occupancy" (CTF-Occ) network. This demonstrates superior performance in the 3D occupancy prediction task. This approach addresses the need for a finer geometric understanding in a coarse-to-fine fashion.

introduction

3D perception is a key component of vision-based autonomous driving systems such as autonomous driving. One of the most popular visual perception tasks is 3D object detection, which estimates the position and size of objects defined in a fixed ontology tree from monocular or stereo camera images. Although the output is a compact 3D bounding box that can be used by downstream tasks, its expressive power is still limited, as shown in Figure 1 below:

4753a0beb284c53a713c657ffbb958d4.png
  1. 3D bounding box representations where the geometric details of the object are eliminated, e.g. a curved bus with two or more sections connected by rotation, a construction vehicle with a robotic arm protruding from the body;

  2. Rarely seen objects, such as litter or tree branches on the street, are usually ignored and not annotated in the dataset because object categories cannot be enumerated extensively in the ontology tree.

These constraints require a general and coherent perceptual representation that can model the detailed geometry and semantics of objects both in and outside the ontology tree. The authors argue that understanding the occupancy state of each voxel in 3D space is important to achieve this goal. A classic task for estimating occupancy states in mobile autonomous driving is the Occupancy Grid Map (OGM). OGM aggregates range measurements (such as lidar scans) over a period of time and estimates the probability that each voxel is occupied within a Bayesian framework. However, this solution assumes a static environment and is not suitable for visual input.

In this work, the authors define a comprehensive 3D scene understanding task called 3D occupancy prediction for vision-based autonomous driving perception. 3D occupancy prediction jointly estimates occupancy status and semantic labels for each voxel in a scene in multi-view images. The occupancy status of each voxel can be free, occupied or not observed. In 3D occupancy prediction, it is crucial to provide voxels with unobserved labels to account for visibility and exclude unobserved voxels. Semantic labels are estimated for occupied voxels. For objects with predefined categories in the dataset, their semantic labels correspond to the respective categories. In contrast, unclassified objects are annotated as general objects (GOs). Although GOs are seldom coded, they are crucial for autonomous driving perception tasks due to safety concerns, as they are usually not detected by predefined categories of 3D object detection.

Furthermore, the authors created a label generation pipeline for the 3D occupancy prediction task to generate dense and visibility-aware ground truth for the scene. The pipeline consists of several steps, such as temporal point cloud separation, dynamic object transformation, lidar visibility estimation, and camera visibility estimation. By utilizing self-pose and object trajectories, point cloud aggregation and dynamic object transformation enhance the density of static scenes and recover the detailed geometry of dynamic objects. Furthermore, the authors utilize a ray-casting-based approach to estimate lidar and camera visibility, as visibility masks are crucial for evaluating 3D occupancy prediction tasks. On the basis of the public Waymo open dataset and the nuScenes dataset, the authors correspondingly generate two benchmarks for the task, Occ3D Waymo and Occ3D nuScenes. The task employs a series of voxel-centric semantic segmentation evaluation metrics. Finally, the authors develop a transformer-based coarse-to-fine 3D occupancy prediction model named CTF-Occ. CTF-Occ aggregates 2D image features into 3D space in an efficient coarse-to-fine manner through cross-attention operations.

In summary, the contributions of this paper are as follows:

  1. The authors propose 3D occupancy prediction, a general and comprehensive 3D perception task for vision-based autonomous driving applications. Occupancy prediction can efficiently reproduce the semantics and geometry of any scene.

  2. The authors develop a rigorous label generation pipeline for occupancy prediction, construct two challenging datasets (Occ3D Waymo and Occ3D nuScenes), and establish a benchmark and evaluation metrics to facilitate future research.

  3. The authors propose a new CTF-Occ network that achieves excellent occupancy prediction performance. For this challenging task, CTF-Occ outperforms the baseline by 3.1 mIoU on Occ3D-Waymo.

related work

3D Detection: The goal of 3D object detection is to estimate the position and size of objects in a pre-defined ontology. 3D object detection is usually performed on lidar point clouds. Recently, vision-based 3D object detection has received more attention due to its low cost and rich semantic content, and several lidar-camera fusion methods have been proposed in the field.

3D occupancy prediction: A related task for 3D occupancy prediction is occupancy grid map (OGM), a classic task in mobile autonomous driving that aims to generate probability maps from continuous noise range measurements. Typically, the pose of autonomous driving is known, and this mapping problem can be solved within a Bayesian framework. Some recent works further combine semantic segmentation with OGM for downstream tasks. Note that OGM requires measurements from ranging sensors such as lidar and radar, and also assumes that the scene is static over time. The 3D occupancy prediction task proposed by the authors does not have these constraints and can be applied to vision-only autonomous driving systems in dynamic scenes. A parallel work, TPVFormer, proposes a three-view method to predict 3D occupancy. However, its output is sparse due to the supervision of LiDAR.

Semantic Scene Completion: Another related task is Semantic Scene Completion (SSC), whose goal is to estimate a dense semantic space from partial observations. SSC differs from 3D occupancy prediction in two ways:

  1. SSC focuses on inferring occluded regions given visible parts, while occupancy prediction does not estimate invisible regions;

  2. SSC is generally suitable for static scenarios, while occupancy prediction is suitable for dynamic scenarios.

3D Occupancy Prediction

task definition

Given a sequence of sensor inputs, the goal of 3D occpancy prediction is to estimate the state of each voxel in a 3D scene. Specifically, the input of the task is a historical sequence of T frames of N surround view camera images, where i=1...N and t=1...T.

The authors also assume known sensor intrinsic parameters and extrinsic parameters {[R_i|t_i]} in each frame. The expected output of the task is the status of each voxel, including occupancy ("occupied", "empty") and semantics (category or "unknown"). For example, voxels on a vehicle are labeled ("occupied", "vehicle") and voxels in free space are labeled ("empty", "none"). Note that the 3D occupancy prediction framework also supports additional attributes as output, such as instance IDs and motion vectors; the authors leave them as future work.

deal with general goals

One of the main advantages of the 3D semantic fuzzy prediction task is the possibility to handle GOs or unknown objects. Unlike 3D object detection, where all object categories are pre-defined, 3D occupancy prediction uses an occupancy grid and semantics to handle arbitrary objects. The geometry of objects is usually represented by voxels, including out-of-vocabulary objects labeled ("occupied", "unknown"). This ability to represent and detect generic objects makes the task more general and more suitable for autonomous driving perception.

evaluation metrics

mIoU: Due to the similarity between the 3D voxel-level occupancy prediction task and the 2D pixel-level semantic segmentation task, the authors use mIoU to evaluate the performance of the model:

075550b695a23ecc06a3e8fa1958c377.png

where , , and denote the true, false positive, and false negative predictions of category c, respectively. Due to the emphasis on vision-centric tasks, in practice many ground-truth voxels are invisible in the image. Therefore, the authors only compute mIoU for visible regions in the image.

Occ3D dataset

Dataset construction pipeline

Obtaining dense voxel-level annotations for 3D scenes can be challenging and impractical. To address this issue, the authors propose a semi-automatic label generation pipeline that leverages existing annotated 3D perception datasets. First, the authors sequentially aggregate points from multiple frames. Then, the authors voxelize the encrypted point cloud. Finally, the authors identify voxel types based on their visibility.

Data preparation: The label generation pipeline (shown in Figure 2 below) requires a 3D dataset where each scene contains the following sensor data:

  1. (multi-view) sequence of camera images;

  2. 3D lidar point cloud sequence;

  3. 3D pose sequences from the IMU.

All camera and lidar intrinsics and extrinsics are also required for coordinate transformation and projection. In addition, the authors require human-annotated box-level semantic labels for common objects, and optionally point-level semantic labels.

7ed1a955cb09d5a3eba49fe4d26bb805.png

Point cloud aggregation: 3D reconstruction from sparse lidar observations is a classic problem in real-time localization and mapping (SLAM) [10]. Given a series of LiDAR point clouds and IMU pose measurements for each frame, the authors can jointly optimize the self-pose and aggregate the point clouds into a unified world coordinate system. However, dynamic objects suffer from motion blur after temporal aggregation. Therefore, the authors treat dynamic and static targets separately. Points of dynamic objects are transformed and aggregated based on bounding box annotations at each frame and self-poses between different frames. For the points of static objects, we just aggregate them according to the self-pose.

Since labeling each frame of a sequence is time-consuming, some existing datasets only label at keyframes, e.g., nuScenes is captured at 10Hz but labeled at 2Hz. Therefore, the authors perform temporal interpolation on the sequence of annotated object boxes to automatically annotate the unannotated frames before performing the dynamic point aggregation described above. Regarding the points in the unlabeled frames that are not bounded by the box, they are likely to be the static background. Therefore, the authors use K-nearest neighbors to vote to determine their semantic labels. In this way, the authors obtain densely annotated foreground dynamic object instances and background static point clouds.

LiDAR Visibility: To obtain a dense and regular 3D occpancy grid from aggregated LiDAR point clouds, a straightforward approach is to set the voxels containing points as "occupied" and the rest as "empty". However, since the lidar points are sparse, some occupied voxels will not be scanned by the lidar beam and may be incorrectly labeled as "empty". To avoid this problem, the authors perform a ray-casting operation to determine the visibility of each voxel. Specifically, the authors connect each lidar point with the sensor origin to form a ray, and a voxel is visible if it reflects the lidar point (“occupied”) or is penetrated by the ray (“empty”); otherwise , which is marked as "not observed". In this way, the authors generate a voxel-level lidar visibility mask.

Occlusion Inference and Camera Visibility: Focusing on vision-centric tasks, the authors further propose an occlusion inference algorithm and generate a camera visibility mask indicating whether each voxel is observed in the current multi-camera view . Specifically, for each camera view, the authors connect each occupied voxel center with the camera center and form a ray. Along each ray, set the voxels up to and including the first occupied voxel to "Observed" and the rest to "Unobserved ". Voxels not scanned by any camera ray are also marked as "not observed". As shown in Figure 3 below, white voxels are observed in the cumulative lidar view but not in the current camera view.

44682b0dd68eca89e0ab188aa9a3521d.png

Note that lidar visibility masks and camera visibility masks can differ for two reasons:

(1) The installation positions of the laser radar and the camera are different;

(2) LiDAR visibility is consistent throughout the sequence, while camera visibility varies at each timestamp.

Determining the visibility of voxels is important for evaluating the 3D occupancy prediction task: the evaluation is only done for "observed" voxels in the lidar and camera views.

Dataset Statistics

Based on the above semi-automatic labeling pipeline, the authors generated two 3D occupancy prediction datasets, Occ3D Waymo and Occ3D nuScenes. Occ3D Waymo contains 798 sequences for training and 202 sequences for validation. It has 14 known object classes and an additional GO class. Occ3D nuScenes contains 600 scenes for training and 150 scenes for validation. It has 16 GO classes. Table 1 below compares the Occ3D dataset proposed by the author with existing datasets in various aspects.

08a019ddae236c4d743422d044cc6270.png

Coarse to Fine Occupancy Model

To address the challenging 3D occupancy prediction problem, the authors propose a new transformer-based model called Coarse-to-Fine Occupancy (CTF-Occ) network. The authors describe the model design in detail in this section.

Overall architecture

Figure 4 below shows the CTF-Occ network architecture diagram.

07e266cd4e121488a248d15d137a6b06.png

First, an image backbone network is used to extract 2D image features from multi-view images. Then, the 3D voxels query the aggregated 2D image features into 3D space via a cross-attention operation. The authors' approach involves the use of a pyramidal voxel encoder that incrementally refines voxel-wise feature representations in a coarse-to-fine fashion through incremental label selection and spatial cross-attention. This approach improves spatial resolution and refines the detailed geometry of objects, ultimately enabling more accurate 3D occupancy predictions. Furthermore, the authors use an implicit occupancy decoder, which allows arbitrary resolution output.

Coarse to Fine Voxel Encoder

Compared to 3D object detection, the 3D occupancy prediction task involves modeling more complex object geometries. To illustrate this point, the authors' method pre-serves 3D voxel space without compressing height. Initially, the authors employ learnable voxel embeddings of shape H×W×L to aggregate multi-view image features into a 3D grid space. Then, the authors stack multiple CTF voxel encoders to achieve multi-scale interaction. Each voxel encoder at each pyramid level consists of three components: an incremental annotation selection module, a voxel-space cross-attention module, and a convolutional feature extractor.

Incremental token selection: As mentioned earlier, the task of predicting 3D occupancy requires a detailed geometric representation, but this can lead to significant computation and memory costs if all 3D voxel annotations are used to interact with regions of interest in multi-view images cost. Considering that most 3D voxel grids in the scene are empty, the authors propose an incremental annotation selection strategy that selectively selects foreground and uncertain voxel annotations in the cross-attention computation. This strategy enables fast and efficient computation without sacrificing accuracy. Specifically, at the beginning of each pyramid level, each voxel token is fed into a binary classifier to predict whether the voxel is empty or not. The authors train the classifier using binary ground-truth occupancy maps as supervision. In our method, the K most uncertain voxel annotations are selected for subsequent feature refinement. There are three ways to define the K most uncertain voxels: voxels with probabilities close to 0.5, the top K non-empty voxels with the highest scores, or a combination of both voxels with a specific percentage. Ablation studies show that selecting foreground voxels at an early stage is a more desirable choice.

Spatial cross-attention: At each level of the pyramid, we first select the top K voxel annotations and then aggregate corresponding image features. In particular, the authors apply spatial cross-attention to further refine voxel-wise features. 3D spatial cross-attention is defined as:

0e845143a85c70c434cb9907ca69740d.png

where i, j are the indices of the camera view and reference point. For each selected voxel annotation query, a projection is implemented to obtain the jth reference point on the ith image. F denotes the feature of the i-th camera view. The authors compute the real-world position (x′,y′,z′) of the reference point corresponding to the query located at p = (x,y,z) as:

064e11be22b2f416cf866730c4d42f6a.png

where H, W, L are the 3D grid space shapes of the current pyramid level, and s is the size of the voxel grid.

Convolutional Feature Extractor: Once the authors apply deformable cross-attention to relevant image features, they start to update the features of foreground voxel annotations. Then, a series of stacked convolutions are used to enhance feature interactions across the 3D voxel-wise feature map. At the end of the current level, the authors upsample the 3D voxel features using triple linear interpolation. The whole process can be described as:

2bb16ace118f46711b82f4c3168e3239.png

Implicit Occupancy Decoder

The CTF voxel encoder generates voxelized feature outputs, which are then fed into multiple MLPs to obtain the final occupancy prediction', where C' is the number of semantic classes. Furthermore, the authors introduce an implicit occupancy decoder that can provide output of arbitrary resolution by exploiting implicit neural representations. The implicit decoder is implemented as an MLP that outputs semantic labels via two inputs: the voxel feature vector extracted by the voxel encoder and the 3D coordinates inside the voxel. The process can be described as

70be3edfe6867f91da32e103563713e9.png

loss function

To optimize occupancy prediction, the authors use the OHEM [30] loss for model training, , , where , and denote the loss weights, labels and prediction results of the k-th class. Furthermore, the authors use binary voxel masks to supervise the binary classification heads in each pyramid level. Binary voxel masks are generated by processing mantissa occupancy labels at each spatial resolution si with, and the output of the binary classification head in level i is denoted pi. The loss for binary classification is defined as , where i denotes the i-th pyramid level. Finally, the total loss is .

experiment

experiment settings

Dataset: Occ3D Waymo contains a total of 1000 publicly available sequences, of which 798 scenes are used for training and 202 scenes are used for validation. The scene range is set from -40 meters to 40 meters along the X and Y axes, and from -5 meters to 7.8 meters along the Z axis. Occ3D nuScenes contains 700 training scenes and 150 validation scenes. The occupied range of X-axis and Y-axis is defined as -40m to 40m, and Z-axis is -1m to 5.4m. The authors choose a voxel size of 0.4m to conduct experiments on two datasets.

Architecture: The author uses ResNet-101 [13] pre-trained on FCOS3D [36] as the image backbone. For Occ3D Waymo, the image is resized to (640×960), and for Occ3D nuScenes, the image is resized to (928×1600 ). In addition to the resolution of the z-axis, the authors adopted the same CTF-Occ network architecture settings for the two datasets. The shape of the voxel embedding is (200×200) with 256 channels. Voxel embeddings will first pass through four encoder layers without token selection. The Occ3D Waymo dataset has three pyramid levels, each with a z-axis resolution of 8, 16, and 32, respectively. Each stage of the Occ3D nuScenes dataset has a z-axis resolution of 8 and 16 (for the two pyramid stages), respectively. Each stage contains an SCA layer, and the top-k ratio of the incremental token selection strategy is set to 0.2 for all pyramid stages.

The authors also extend two mainstream BEV models - BEVDet [14] and BEVFormer [18] to the 3D occupancy prediction task. The authors replaced their original detection decoder with the occupancy decoder adopted in the CTF-Occ network, and kept their BEV feature encoder. According to their original setting, the authors employ a ResNet101 DCN initialized from FCOS3D [36] checkpoints as the image backbone.

Implementation details: The authors use the AdamW optimizer [23] and a cosine learning rate scheduler with the learning rate set to 2e-4. Unless otherwise specified, all models were trained for 24 epochs for comparison and 8 epochs for ablation studies.

6.2. Comparison with previous methods

Occ3D nuScenes: Table 2 below shows the performance of 3D occupancy prediction compared to related methods on the Occ3D nuScenes dataset. It can be observed that under the IoU metric, our method performs better than previous baseline methods in all classes. These observations are consistent with those in the Occ3D Waymo dataset.

90e3201b58ba5fd9230805c85fbb1aeb.png

Occ3D Waymo: The authors compare the performance of the CTF-Occ network with state-of-the-art models on the newly proposed Occ3D Waymo dataset. The results are shown in Table 4 below. Our method has a significant advantage over previous methods, i.e. increases mIoU by 3.11. Especially for some small objects, such as pedestrians and bicycles, the method outperforms the baseline method by 4.11 and 13.0 IoU, respectively. This is because the authors capture features in 3D voxel space without compressing height, which preserves the detailed geometry of the object. The results demonstrate the effectiveness of our coarse-to-fine voxel encoder.

62373e4dfeba916e499a80ffabfafbe0.png

Ablation study

In this section, the authors ablate incremental token selection and OHEM loss selection. The results are shown in Table 3 below. CC stands for traffic cone and PED stands for pedestrian. The author focuses on CC and PED to verify that the author achieves on small targets. Both techniques improve performance. Using OHEM loss and top-k token selection yields the best performance. Without OHEM loss, the authors can only get 10.06 mIoU. Combining the OHEM loss with a random token selection strategy achieves 14.75 mIoU. The uncertain token selection strategy using OHEM loss achieves 17.37mIoU. For token selection, nondeterministic selection is comparable to top-k selection, and they are significantly better than random selection.

6f21ea3ae8ac9b6062a72810646b9bef.png

qualitative results

The authors compare the CTF-Occ network output with the state-of-the-art method BEVFormer Occ on the Occ3D Waymo dataset in Figure 5. The authors can see that the CTF-Occ network outputs a more detailed voxel geometry than the BEVFormer-Occ results. Furthermore, the authors' voxel decoder is capable of producing output at any resolution, independent of the resolution of the ground truth data.

f53bde85d322df8776b72993b7c35b73.png

in conclusion

The authors present Occ3D, a large-scale 3D occupancy prediction benchmark for visual perception. The benchmark includes a data generation protocol, two datasets and the model CTF-Occ network for the task. They will all be open-sourced to facilitate future research. Research shows that semantic occupancy provides more expressive and rich representations for objects. Furthermore, it provides a unified representation of known and unknown objects, which is crucial for outdoor autonomous driving perception. Besides direct use, this benchmark opens several avenues for future research. For example, adding instance IDs to semantic voxels will essentially change the task to panoptic segmentation and provide richer information.

reference

[1] Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Open source address: https://tsinghua-mars-lab.github.io/Occ3D/

Video lessons are here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction, etc. Direction learning video, welcome to take it by yourself (scan code to enter learning)

bd681ecb27b8113265d5a267e072407f.png

(Scan the code to learn the latest video)

The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, exchange various problems in getting started, studying, working, and job-hopping with the big guys in the field, share papers + codes + videos daily , look forward to the exchange!

5d585ef2cdb8c6d9891195b65d6f2aac.jpeg

[ Heart of Automated Driving ] Full-stack Technology Exchange Group

The Heart of Autonomous Driving is the first developer community for autonomous driving, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, light Flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job exchange, etc.;

b7c3e518960add49fe209720d0ca871b.jpeg

Add Autobot Assistant Wechat invitation to join the group

Remarks: school/company + direction + nickname

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/131346175