【BEV】TPVFormer recurrence and principle

1 Introduction

In the network of looking around images, bird's eye view is often used for feature extraction. Although it is more efficient than voxel representation, it will also lose some information. In order to solve this problem, TPVFormer paper proposed three views to represent three-dimensional features. , and it is verified in experiments that only using images as input can achieve comparable segmentation results to radar.

This article mainly introduces how to run the mini data set locally and generate the corresponding video. The source code will be further studied later.

mini dataset: https://pan.baidu.com/s/1oKvicVacbPFZNtXO7l9t7A?pwd=p4h4 Extraction code: p4h4

Result visualization: https://www.bilibili.com/video/BV1oX4y1o7FQ/?spm_id_from=333.999.0.0
BEV communication group, group v: Rex1586662742, group q: 468713665.

2. run

In the warehouse of TPVFormer, the author only made nuscenes_infos_train.pkl and nuscenes_infos_val.pkl for the complete nuscenes data set. For learners, it is usually impossible to test on the complete nuscenes data set. After consultation, the original author also The pkl file of the mini data set is given, which can be obtained through the link below. and the liar file

2.1 run eval.py

After organizing the data set, run the following command to verify

python eval.py --py-config xxxx --ckpt-path xxxx

Running it directly should report an error. If the error is that there is no "lidarseg" in self.table_names, you need to modify the following content
to /home/snk/anaconda3/envs/tpv/lib/python3.8/site-packages/nuscenes_devkit-1.1.10
- Add a variable 'lidarseg' to self.table_names in the py3.8.egg/nuscenes/nuscenes.py file

self.table_names = ['category', 'attribute', 'visibility', 'instance', 'sensor', 'calibrated_sensor',
                            'ego_pose', 'log', 'scene', 'sample', 'sample_data', 'sample_annotation', 'map','lidarseg']

Also add a line of code around

self.lidarseg = self.__load_table__('lidarseg')

run again

python eval.py --py-config xxxx --ckpt-path xxxx

2.2 vis_scence.py

There may be problems with installing the environment according to the instructions in the project, you can install it in the following way

pip install vtk==9.0.1
pip install mayavi==4.7.3
sudo apt update
sudo apt install xvfb

After the installation is complete, you can run and generate the video. For the personally generated video, see the link below.

python visualization/vis_scence ... 
python visualization/generate_videos.py

If an error is reported that there is a problem with pyqt5, uninstall pyqt5

3 Introduction to the paper

3.1 Principle learning

Generally, only the top view is used to calculate the three-dimensional features. In this paper, a method for characterizing three-dimensional features is proposed, that is, tri-perspective view representation (TPV). Through the features of three directions, it is easy to complete the pure visual 3D Segmentation, 3d semantic segmentation, etc., the author benchmarks TPVFormer against Tesla's occupancy network. The main process of TPVFormer is shown in the figure below:
insert image description here

The input is 6 look-around pictures. Through Image Backbone, feature layers of different scales can be obtained. Multi-scale feature layers are currently widely used. Then the TPV features can be obtained through the TPVFormer module, and finally the three directions are aggregated in the voxels of [100,100,8]. Each voxel feature is obtained by adding the features of the three directions. During training, real Lidar is used for supervision, and during prediction, dense voxel features can be output.

img_feats = self.extract_img_feat(img=img, use_grid_mask=use_grid_mask) // 提取多尺度特征
outs = self.tpv_head(img_feats, img_metas)  # [1, 10000, 256]、 [1, 800, 256]、[1, 800, 256] 三个方向的BEV特征
outs = self.tpv_aggregator(outs, points)   # 分割结果

Why propose a feature map in three directions, the paper uses the following figure to illustrate:
insert image description here
If Voxel is used directly to represent three-dimensional features, it will greatly increase the amount of calculation, and directly using BEV features will lose height information, while TPV A compromise is made between the previous two, which greatly reduces the amount of calculation while retaining the characteristics of different views. Therefore, how to obtain TPV features is the focus of this paper.
insert image description here

The above picture is the supplementary version of the first picture, mainly looking at the second half, TPVFormer can be divided into Cross-Attention and HyBird-Attention, where Cross-Attention is to do self-atten on the feature layer on different scales, HyBird-Attention is Do self-atten between the three features of TPV, and all use deformable transformers to reduce the amount of calculation. After passing through TPVFormer, the TPV features are obtained. Through the TPV feature, any feature of Voexl in 3D space can be obtained, and then it can be classified by using the segmentation head to achieve the effect of Occupancy.

3.2 Results

insert image description here

4. Summary

This article introduces how to run TPVFormer locally, and learns the principle pictures in the paper. It mainly learns how to extract TPV features and how to use TPV features in the article. The code of the article is very friendly. Source code for learning.

Guess you like

Origin blog.csdn.net/weixin_42108183/article/details/129629303