Tsinghua & Tianjin University's new work | SurroundOcc: Pure Vision 3D Semantic Occupancy Prediction for Autonomous Driving Scenes (Open Source)

This article was first published on the WeChat public account CVHub, and may not be reproduced to other platforms in any form. It is only for learning and communication, and offenders will be held accountable!

Title: SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Paper: https://arxiv.org/pdf/2303.09551.pdf

Code: https://github.com/weiyithu/SurroundOcc

guide

Most of the traditional 3Dscene understanding methods focus on 3D object detection, which is difficult to describe real-world objects of arbitrary shapes and infinite categories. SurroundOccThe method proposed in this paper can perceive 3D scenes more comprehensively . Multi-scale features are first extracted for each image, and spatial 2D-3Dattention to lift them into 3D volumetric space. Then, volumetric features are progressively upsampled by 3D convolutions and supervised at multiple levels.

In addition, in order to obtain dense occupancy ( occupancy) predictions, this paper designs a process to generate dense occupancy labels, which can save a lot of manpower and time costs. Specifically, this paper fuses multi-frame LiDARscans , and then employs Poissona reconstruction method to fill holes and voxelize the mesh data to obtain dense occupancy labels. The method in this nuScenespaper Semantic KITTIhas carried out a large number of experiments on and data sets, and proved its effectiveness.

::: block-1

Occupancy networks

The Occupy Network FSDis a new type of sensory network launchedAshok Elluswamy by the head of Tesla. CVPR 2022It draws on occupancy grid mappingthe ideas , and performs online 3D reconstruction of the perceived environment in a simple form.

In simple terms, divide the world into a series of grid cells, and then define which cells are occupied and which cells are free. A simple representation of 3D space is obtained by predicting occupancy probabilities in 3D space.
:::

creative background

Figure 1. SurroundOcc overview

Understanding the 3D geometry of the surrounding environment is the first step when we want to build an autonomous driving system. Although LiDAR technology is a direct and effective method to capture geometric information, the two major problems of high cost and sparse scanning points limit its further application. In recent years, vision-only autonomous driving technologies with multi-camera images as input have shown competitive performance in various 3D perception tasks, including depth estimation, 3D object detection, and semantic map construction, etc. and extensive attention from the industry.

Multi-camera 3D object detection has been playing an important role in 3D perception tasks, but when faced with the long-tail problem, it is often at a loss and it is difficult to recognize all categories of objects in the real world. Complementary to 3D object detection, reconstructing the surrounding 3D scene can better help downstream perception tasks. There have been many recent works to predict surrounding depth maps by combining information from multiple viewpoints. However, the depth map only predicts the nearest occupied point in each ray, and there is no way to recover the occluded parts of the 3D scene.

Different from depth-based methods, another approach is to directly predict the 3D occupancy of the scene, that is, describe the 3D scene by assigning an occupancy probability to each voxel in 3D space. However, current work of this type simply uses multi-camera results for post-processing fusion, or cannot generate dense occupancy predictions through sparse point cloud supervision, which limits performance.

::: block-1

Voxel-based scene representation

MonoScene、TPV-Former

Discretizing 3D space into voxels, and describing each voxel with vector features, is widely used in tasks such as lidar segmentation and 3D scene completion. For 3D occupancy prediction tasks, voxel representations are more suitable for simulating the occupancy of 3D scenes.

3D scene reconstruction

SurfaceNet、Atlas、NeuralRecon

A depth value is predicted for each pixel by depth estimation. Depth estimation methods require full depth annotations to supervise the depth estimation models, and later studies pay more attention to self-supervised depth estimation.

Vision-Based 3D Perception

Explicit Depth Method, Implicit Depth Method

Due to the lack of direct geometric input, vision-based 3D perception needs to exploit semantic cues to infer 3D scene geometry. Explicit depth methods explicitly predict the depth map of the image input to extract the 3D geometric information of the scene, and then project it into 3D space. Implicit depth methods implicitly learn 3D features without producing explicit depth maps.
:::

To address the above issues, this paper proposes the SurroundOcc method, which aims to predict dense and accurate 3D occupancy using multi-camera image inputs. First, a 2D neural network is used to extract multi-scale feature maps from each image. Then, a 2D-3D spatial attention mechanism is used to lift the multi-camera image information to 3D volumetric features instead of BEV features. Next, a 3D convolutional network is employed to gradually upsample low-resolution volumetric features and fuse them with high-resolution features to obtain fine-grained 3D representations. At each level, the network is supervised using a decaying weighted loss.

In addition, to avoid expensive occupancy annotations, this paper also proposes a set of methods to generate dense occupancy labels through existing 3D object detection and 3D semantic segmentation labels to generate dense occupancy labels. Specifically, multi-frame point clouds of dynamic objects and static scenes are first combined respectively. Then Poisson reconstruction algorithm is used to further fill the holes. Finally, the nearest neighbor algorithm (Nearest Neighbor, NN) and voxelization are used to obtain dense 3D occupancy labels.

method

overview

Figure 2. The pipeline of the method in this paper

As shown in Figure 1, our method consists of a multi-stage pipeline. First, multiple cameras and multiple levels of multi-scale features are extracted using a backbone network ResNet-101such as . Then, at each level, use transformerto fuse multiple camera features and use spatial cross-attention to improve accuracy. The output of the 2D-3D spatial attention layer is the input of the 3D convolutional network. Finally, multi-scale volumetric features are upsampled and combined by a 3D convolutional network, with decreasing loss weights to supervise occupancy prediction at each level.

2D-3D Spatial Attention

Figure 3. Comparison of 3D-based and BEV-based cross-view attention

Traditional methods usually integrate 2D features from multiple views into 3D space, but this method assumes that different views contribute equally to 3D space, which is not always true in actual scenarios, especially when some views When blocked or obscured.

To address this problem, this paper utilizes a cross-view attention mechanism to fuse features from multiple cameras. By projecting 3D reference points into a 2D view, and using a deformable attention mechanism to query and aggregate information on these points. Different from traditional methods, this method builds a 3D volumetric query, which further preserves 3D spatial information. By projecting these query points, 2D features can be sampled in the corresponding views and aggregated weightedly using a deformable attention mechanism. Finally, 3D convolutions are used to interact adjacent 3D voxel features, which improves the accuracy of 3D scene reconstruction.

Multiscale Occupancy Prediction

Improve the performance of 3D scene reconstruction by extending the 2D-3Dspatial attention mechanism to multi-scale. Specifically, this paper adopts 2D-3D U-Netthe architecture to input multi-scale 2D features into different numbers of 2D-3D spatial attention layers to extract multi-scale 3D volumetric features. Then, the 3D volume features of the previous level are sampled through the 3D deconvolution layer and fused with the features of the current scale to generate the 3D volume features of the current scale. The network outputs an occupancy prediction at each scale. In order to obtain rich multi-level 3D features, the network has supervision signals at each scale.

Furthermore, this paper uses cross-entropy loss and scene category affinity loss as supervision signals, and supervises the network at different scales. For 3D semantic occupancy prediction, the authors employ a multi-class cross-entropy loss, which is changed to a binary form for 3D scene reconstruction. In order to emphasize the high-resolution prediction results, we use the decay loss weight α j = 1 / 2 j α_j = 1 / 2jaj=1/2 j performs supervision on different scales.

Dense Occupancy Tags

Figure 4. Dense occupancy label generation

A network that uses sparse LiDAR point clouds as a supervision signal cannot predict dense enough occupancy, and thus needs to generate dense occupancy labels. However, labeling dense occupancy of millions of voxels is a complex task that requires a lot of human effort.

To address this issue, this paper proposes a pipeline to generate dense occupancy labels that leverages existing 3D detection and semantic segmentation labels instead of human annotations. Specifically, this paper proposes to stitch together multi-frame LiDAR point clouds of dynamic objects and static scenes separately, and uses Poisson reconstruction algorithm to fill holes and voxelize the obtained mesh to obtain dense occupancy.

Multi-frame point cloud stitching

This paper proposes a method two-stream pipelinethat can stitch static scenes and movable objects separately, and then merge them into a complete scene before voxelization.

Specifically, for each frame, the movable objects in the LiDAR point cloud are first cut out according to the 3D bounding box labels, so as to obtain the 3D point clouds of the static scene and the movable objects. Then, after traversing all the frames in the scene, the collected static scene fragments and object fragments are respectively integrated into a collection. To merge multi-frame clips, their coordinates are transformed into world coordinates by the known calibration matrix and self-pose. Finally, according to the object position and self-pose of the current frame, the 3D point cloud of the current frame is obtained by merging the 3D point cloud of the static scene and the object. In this way, the occupancy label of the current frame utilizes the LiDAR point cloud information of all frames in the sequence.

Poisson reconstruction densification

First, the normal vector is calculated according to the spatial distribution in the local neighborhood, and then the point cloud is reconstructed into a triangular mesh using the Poisson surface reconstruction algorithm, and then the holes in the point cloud are filled to obtain uniformly distributed vertices, and finally the mesh is converted into a dense voxel. This method can increase the point cloud density and fill the gaps in the point cloud.

Semantic annotation using NN algorithm

Figure 5. Comparison of different occupancy tags

This paper proposes to utilize a NN algorithm to assign a semantic label to each voxel in order to convert dense point clouds into dense voxels. Firstly, the point cloud with semantic information is voxelized to obtain sparse occupancy labels, and then the NN algorithm is used to search for the nearest sparse voxel to each voxel and assign its semantic label to the voxel. This process can be parallelized by the GPU to increase speed. The resulting dense voxels provide more realistic occupancy labels and clear semantic boundaries.

experiment

Compared with other SOTA methods on the two mainstream data sets of nuScenes and SemanticKITTI, the method in this paper has achieved the best accuracy .

By gradually adding different components, the model performance is further improved .

In rainy or night scenes, although the RGB quality is degraded, the method in this paper can still predict the delicate occupancy .

Summarize

This paper proposes a multi-camera 3Doccupancy prediction method SurroundOcc. The method utilizes 2D-3Dspatial attention to integrate 2D features into 3D volumes in a multi-scale manner, which are then upsampled and fused by 3D deconvolutional layers . In addition, this paper also designs a set pipelineto generate dense occupancy labels by stitching multi-frame LiDAR point clouds of dynamic objects and static scenes, and using Poisson reconstruction algorithm to fill the holes . On nuScenesand Semantic KITTIdata sets, the advanced nature of this method has been fully verified .

write at the end

If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, let's discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/129741524