Occupancy Prediction competition plan summary

Analysis and summary of the top five open source solutions in the Occupancy Prediction competition of CVPR2023.

This is the first time I participated in a CVPR competition, and I don’t have much experience in the competition. This article summarizes the open source competition plan of the big guys, and puts forward some thoughts of my own. Welcome to the comment area for exchanges.

Occupancy Prediction task description

3D Occupancy Prediction (Occ) is a detection task proposed by Telsa in 2022 AI Day. The task is proposed because the 3D target frame detected by the previous 3D target detection is not enough to describe general objects (objects not in the data set). Here In the task, the object is divided into voxels for expression, and the network is required to predict the category of each voxel in the 3D voxel space, which can be considered as an extended task of semantic segmentation in the 3D voxel space. The specific prediction diagram is shown in the figure below Show.

Source: https://github.com/NVlabs/FB-BEV

In this competition, the Occ dataset is constructed based on the Nuscenes dataset, and the contestants are required to predict the occupancy of the 200x200x16 3D voxel space using only the image modality. The evaluation index uses MIoU, and will Evaluate predictions only in the visible range of the image:

For more detailed rules, you can look at the official GitHub:

https://github.com/CVPR2023-3D-Occupancy-Prediction/CVPR2023-3D-Occupancy-Prediction

Introduction to Baseline

In the competition, there are two Baselines to choose from, one is the official implementation based on the BEVFormer framework, and the other is based on the BEVDet framework, which also represent the two mainstream implementation routes in 3D object detection. LSS and Transfromer.

Both Baselines stretch the features of the original input detection head from BEV space to 200x200x16 3D voxel space, and then connect a simple semantic segmentation head to predict the result of 3D occupancy. The specific performance is shown in the following table Show:

Method meow
BEVFormer-R101 23.67
BEVDet-R50-256x704 36.1
BEVDet-R50-384x704 37.3
BEVDet-R50-Longterm-384x704 39.3
BEVDet-STBase-512x1024 42.0

From the official and open-source Baseline, several obvious scoring methods can be found:

  • Greater input resolution

  • Using more timing information

  • Use a more advanced Backbone to extract features

The reason why the official BEVFormer is quite different from BEVDet is that it does not use a mask camera during training, that is, it calculates the loss outside the visible range of the image, while BEVDet does not calculate it during training. The reason for this was seen in five ablation experiments.

Next, the top five open source methods will be introduced. If there is something wrong, please correct me in the comment area!

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

The first-place team is from NVIDIA. The overall construction framework does not use LSS and BEVFormer, but is based on a set of framework FB-BEV developed by NVIDIA itself. At present, its code is still open source, and there is no detailed paper introducing FB-BEV. Below, the design of the network structure, the expansion of the network model, and the application of pre-training and post-processing will be introduced.

The open source papers for this competition are as follows:

https://opendrivelab.com/e2ead/AD23Challenge/Track_3_NVOCC.pdf

Network structure design

The following is my initial understanding of the FB-BEV and FB-OCC networks. Welcome to communicate in the comment area.

For the generation of BEV space, it can be considered that LSS is a forward generation process. During forward reasoning, a rough BEV feature can be generated through depth estimation, while BEVFormer generates BEV features in the form of BEV queries. In the forward direction, the BEV queries are unknown and artificially defined. The real generation of the BEV features is learned through the backpropagation model.

(a) LSS generates lift to generate BEV space; (b) BEVFormer uses BEV queries to generate BEV space

However, FB-BEV combines the advantages and disadvantages of the two, and designs a BEV spatial feature generation method that includes both forward and reverse: since the depth estimation in the LSS method is discrete, the 3D spatial features generated by it are also Relatively sparse, and BEV queries in BEVFormer do not have this problem, but the original randomly initialized BEV queries will have the problem of poor optimization, so FB- BEV uses the 3D voxel space generated in LSS as the initial BEV queries value, so that it can be better optimized. Finally, the features of the two parts are fused to make the network have a better description of the 3D space . Its specific structure is shown in the figure below.

FB-OCC structure diagram, F-VTM means forward BEV generation, similar to LSS, B-VTM means reverse BEV generation, similar to BEVFormer

In the end, the author uses the prediction method of multi-scale prediction fusion to design the occupancy detection head. The specific structure is as follows:

Occupancy Head structure diagram

Extended model and pre-training applications

Using a better and larger model is undoubtedly the simplest and most crude way to increase network performance, but an overly large model may cause problems such as overfitting on Nuscenes. InterImage-H is used as their Backbone, and in order to better apply InterImage-H, the author also pre-trained it on object365 based on the original COCO pre-training, so that it can be better applied here on task.

But the application of the pre-training model in this way is only at the Backbone level, and the perception ability of the network is not greatly improved, so the author wants to pre-train the network on Nuscenes for depth estimation, but the training of the pure depth estimation task , will cause the weights in the pre-training model to be biased towards this task, which is also not conducive to the perception task, so the network is pre-trained in the joint task of depth estimation and 2D semantic segmentation to achieve the effect of improving the perception ability of the pre-training model .

However, the Nuscenes dataset does not provide 2D semantic segmentation labels for pictures, so the author uses the recently popular SAM for automatic labeling, and uses the detection frame and point cloud segmentation labels in Nuscenes to generate more accurate masks. The training diagram is shown in the figure below:

 

Depth estimation and 2D semantic segmentation joint pre-training map

(In fact, I personally think that if it is troublesome, you can use depth estimation and 3D target detection tasks or 2D target detection tasks for joint training to achieve similar results, but since 2D semantic segmentation is actually very similar to 3D occupation tasks, it is possible effect will be better)

Post-processing

In the post-processing, different from the strategies in some classic games, the author also added a time-series TTA operation to the TTA. The specific operation is a personal guess that uses different numbers of time-series frames to TTA in the forward direction.

In addition, the author observed that in the same scene, the recognition effect of the network for the distance is not as good as that of the close-up, so the prediction results of the stationary objects in the near can be fixed. When the vehicle goes far away, the original close-up becomes If the distant view is removed, the prediction results of stationary objects can be replaced .

In the end, its fusion result reached 54.19% on the test set, and it has no opponents in this competition. I personally think that its thinking about the pre-training model is more worth learning and thinking about than the structural design of the network.

MiLO: Multi-task Learning with Localization Ambiguity Suppression for Occupancy Prediction

The second place team is from 42dot autonomous driving company, and its network framework is developed using BEVDet as a whole. Since it does not improve the network structure much, it is mainly introduced from the following two parts, namely multi-task training and post-processing design.

The open source papers for this competition are as follows:

https://opendrivelab.com/e2ead/AD23Challenge/Track_3_42dot.pdf

multitasking training

The multi-task training idea here is actually the same as the application purpose of the first place for the pre-training model, which is to better optimize the network, because the overall framework of BEVDet is transformed under multiple perspectives (2D, 2D-3D, 3D) , if only supervised training is performed on the final output features, it is difficult to optimize the underlying network well. This problem is also mentioned in PANet. So the author introduces a new branch in the 2D FPN part to do 2D semantic segmentation tasks to supervise the optimization of the 2D part of the network , while 2D-3D and 3D use the same depth supervision and 3D occupancy task supervision as BEVDet.

Connect two layers of ResNet after FPN to do 2D semantic segmentation, and use this feature to update the features of FPN, and send it to the 2D-3D network

Here the author did not use the more complex method in the first place to generate semantic segmentation labels in 2D images, but directly projected the point cloud segmentation labels into the image space, and only calculated these points when calculating loss.

Post-processing

The author noticed that the positioning from the image is relatively vague. As shown in the figure below, pedestrians are indeed predicted in a certain area, but the specific location of the prediction is blurred.

The author then analyzed his own prediction results, set 6 different distances, and observed the mIoU performance at 6 different distances, and found that some categories, such as bicycle and motorcycle, had mIoU at the furthest distance almost equal to 0. So the author set different thresholds for 6 different distances based on reasoning observations and tests on the verification set. If the predicted score is lower than this threshold, the category is set to free to alleviate this problem.

In the end, the result reached 52.45% on the test set, and the 1st and 2nd place were successful. Whether it is pre-training for 2D semantic segmentation or introducing multi-task training, the introduction of 2D tasks can better optimize this task. network of.

UniOcc: Unifying Vision-Centric 3D Occupancy Prediction with Geometric and Semantic Rendering

The third-place team is from Xiaomi, which achieved a score of 51.27% in the case of a single model without using label data. It can be considered that others are just here for fun. It does not stick to the framework form of LSS or BEVFormer, but proposes some plug-and-play things for this task to improve the network effect.

The open source papers for this competition are as follows:

https://opendrivelab.com/e2ead/AD23Challenge/Track_3_UniOcc.pdf

It regards the 3D occupation task as a rendering problem, and tries to use the idea of ​​NeRF to solve it, and finally uses the Teacher-Student training idea to train the model. Here, we will focus on how to convert the 3D occupation into NeRF . .

NeRF

NeRF processing

 Due to the limitation of computing resources, NeRF cannot sample too many points on one ray, but sometimes, the points in some places can provide little information, or even empty, which is not conducive to network modeling. The sampling method is to sample more points on the entity, so a layered sampling method is proposed, which can also be considered as a simultaneous coarse-grained and fine-grained sampling method : it first randomly samples on the light, Carry out coarse-grained estimation, and then resample at places with higher density according to the estimation results to obtain fine-grained estimation, while optimization is to optimize both coarse-grained and fine-grained at the same time, and the color value obtained by fine-grained is as follows:

From Occupancy to NeRF 

Overall block diagram of UniOcc 

Among them, βk=zk+1−zk\beta_{k}=z_{k+1}-z_{k}\beta_{k}=z_{k+1}-z_{k}, that is, two sampling points directly The predicted 3D occupancy can be obtained through the rendering of geometric information and semantic information. (My understanding of NeRF is not very good, there may be a place to explain the problem here) 

Multi-Scale Occ: 4th Place Solution for CVPR 2023 3D Occupancy Prediction Challenge

The fourth place comes from SAIC AI LAB. Its overall framework design adopts the design idea of ​​BEVDet, which mainly proposes to use multi-scale information for training and prediction and a prediction method of decoupling heads.

The open source papers for this competition are as follows:

https://opendrivelab.com/e2ead/AD23Challenge/Track_3_occ-heiheihei.pdf

Multi-scale network structure design

The multi-scale design of its network refers to the SurroundOcc design. It uses FPN to establish multi-scale features, and then uses features of different scales to generate 3D voxel features of different sizes. Here, three of them, 50x50x4, 100x100x8 and 200x200x16, are used. The size of the voxel space, and finally use the form of 3D UNet to fuse the features of different scales to predict the 3D occupancy, and the information of different scales will calculate the loss during training.

Decoupling header design

Due to the obvious category imbalance problem in this task, the Free category occupies 96% of the training set. In order to solve this problem, the author decouples the prediction head and divides it into two categories: whether it is Free or not. head, and the other 16 categories of semantic segmentation head two parts. The binary classification head uses BCE as the loss function, while the semantic segmentation head uses Focal loss as the loss function. The final total loss function is as follows:

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

The fifth place team is from Harbin Institute of Technology. The overall network structure adopts the design idea of ​​BEVFormer. The network design mainly improves the 3D UNet in the head . and combined with 3D object detection for post-processing .

The open source papers for this competition are as follows:

https://opendrivelab.com/e2ead/AD23Challenge/Track_3_occ_transformer.pdf

data augmentation

In order to enable the model to explore more local features, the author performs a cutout data enhancement operation on the image to enhance the image .

(In fact, I also tried to use additional data enhancement in the early stage of the competition, but our framework is borrowed from BEVDet as a whole, but it is also mentioned in the BEVDet paper that data enhancement is simply done in the image space, not in the BEV space. On the contrary, it will reduce the performance of the network, so some attempts have failed. Data enhancement such as cutout here may only be applicable to the BEVFormer framework)

Model fusion post-processing

The authors found that on dynamic objects, 3D object detection performed better than 3D occupancy tasks. So the author uses StreamPETR to generate a 3D detection frame, and converts the detection frame into a 3D occupied form. Specifically, according to different categories, set different thresholds, select a high-confidence detection frame, and generate a point cloud with a spacing of ttt in the frame, and voxelize the point cloud, and then label it according to the predicted category. Finally, this result is fused with the prediction result of the Occ network.

Summarize

The final top 10 list is as follows whaosoft  aiot  http://143ai.com  

 

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131465686