Paper speed reading -- BEVerse

论文速读 – BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

先赞后看,养成好习惯。有帮助的话,点波关注!我会坚持更新,感谢您的支持!

References :
1. BEVerse
2. Analysis of 3D Vision Workshop
3. Evaluation index mAP
4. NuScenes dataset evaluation index

1. Summary

The BEVerse network is proposed in this paper, which is a unified architecture to complete multi-vision-based perception and prediction tasks. Using multi-time stamp and multi-view images for shared feature extraction and lifting to generate 4D BEVrepresentations. After self-motion compensation, a spatio-temporal encoder is utilized for further BEV feature extraction. Finally, a multi-task decoder is used for joint inference and prediction. On the decoder, a 栅格采样器(grid sampler) is proposed to generate BEV features supporting different ranges and granularities. In addition, an iterative flow method is designed to achieve memory-efficient prediction. Experiments found that 时域信息可以提升3D目标检测和语义地图的构建, and multi-task learning is also beneficial for motion prediction.


2. Introduction

insert image description here
Main work :

  • A framework BEVerse for multi-camera view BEV representation is proposed 统一了感知和预测任务.
  • Proposed 迭代流methods for efficient future prediction and multi-task learning.
  • As a multi-task model, BEVerse has reached the sota level in 3D object detection, semantic map construction and motion prediction tasks.

Related work :
3D object detection: FCOS3D, PGD, DETR3D, PETR, BEVDet
semantic map construction: HDMapNet (online construction), BEVSegFormer
motion prediction: mostly unsupervised learning methods, FIERY (the first BEV motion prediction framework), StretchBEV
multi-task learning : The work focuses on how to design a shared structure and how to balance and optimize multitasking. FAFNet, MotionNet

3. Network and method

BEVerse takes M surround-view camera images from N timestamps and takes vehicle ego-motion and camera parameters as input. 3D boundingboxThe data result contains the , 语义地图and , of obstacles in the current frame 运动预测. BEVerse 四个子模块consists of: 图像-视图编码器、视图转换器、时空BEV编码器和多任务解码器.
insert image description here

3.1 Image-View Encoder

Use SwinTransformer as the backbone network (backbone) to create multi-level features C2, C3, C4, and C5, and the width and height of each layer are halved. Use upsampled C5 in BEVDet and concatenate it with C4.

3.2 View Converter

Since 3D timing information is to be learned, the view converter will 多视图特征Foutput BEV特征G. Using LSS(LiftSplat-Shoot)the method, the feature F uses 1*1 convolution processing to predict the classification depth distribution F'.

3.3 Spatio-temporal BEV encoder

First, the past frames are time-aligned. Using the FIERY method, the BEV encoder consists of a set of temporal blocks. It mainly includes 3D convolution and global pooling and intermediate feature compression layers.

3.4 Task Decoder

A multi-task decoder is composed of a parallel and independent set of decoders. Each task decoder includes a grid sampler, a task encoder and a task head. The function of the grid sampler is to crop the special area of ​​the task and convert it to the ideal resolution through bi-linear interpolation. The task encoder follows BEVDet, using the basic modules in ResNet to build a backbone network and combining multi-scale features similar to image-view decoders.

3.5 Output head

3D object detection head . The dimension gap with the laser has disappeared, and the first stage of CenterPoint is directly used as the 3D inspection head.
Semantic map building header .
Motion prediction head . Unlike the headers above which only care about the current frame, motion prediction is a prediction of the future state. The effectiveness of the FIERY prediction module is constrained by two important factors: (1) Each BEV pixel shares the sampled global latent vector φt, which cannot represent the uncertainty of many individuals. (2) Only initialize the future state from sampled latent vectors, which increases the difficulty of prediction. Different from FIERY, we propose an iterative flow scheme to directly predict and sample latent maps, which can separate the uncertainties of different targets.
insert image description here


4. Experiments and Results

Dataset : Nuscenes, 1000 autonomous driving video clips, each clip 20s

  • 700 --> training
  • 150 --> Verify
  • 150 --> test

Evaluation criteria :
3D目标检测: nuScenes data set evaluation index
mAP: area under the PR curve, average value of various types
ATE, ASE, AVE, AOE, AAE
语义地图构建: mIoU, the main elements include lane lines, pedestrian crossings, road boundaries
运动预测: IoU and VPQ(Future Video Panoptic Quality)
insert image description here

result :
insert image description here

おすすめ

転載: blog.csdn.net/weixin_36354875/article/details/126602249