ICCV 2023 | Faster and Stronger! BIT & Megvii proposed StreamPETR: pure visual perception and LiDAR will eventually have the power to fight!...

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【3D Object Detection】WeChat exchange group

Author: ChrisM | Authorized reprint (Source: Zhihu) Editor: CVer

https://zhuanlan.zhihu.com/p/645470702

Although pure visual BEV perception has been developed for about two years, there is always a big gap from the lidar algorithm. BEVDepth narrowed the gap between pure vision and lidar to less than 10% for the first time, but a series of subsequent works seemed to have fallen into a bottleneck, mostly around 62% NDS (without considering the buff of future frames). Does the performance of pure visual perception really stop here?

For this question, here is the latest work "StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection" that we just included in ICCV 2023. It proposes to use object query as a timing transfer carrier, becoming the pure visual BEV perception algorithm (67.6% NDS, 65.3% AMOTA) that surpasses CenterPoint (67.3% NDS, 63.8% AMOTA) for the first time without using future frames. At the same time, due to the high efficiency of object query timing delivery, StreamPETR introduces almost no additional calculation compared to single-frame PETR. This makes StreamPETR the fastest BEV perception algorithm at the same time, 2.8 times faster than SOLOFusion!

60dc901f77c82906cfe1fa747feabc1d.png

c78b001e01d90ca79bcb7f08d9b11715.png

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

Article link: arxiv.org/abs/2303.11926

Code: https://github.com/exiawsh/StreamPETR

motivation

9a5e6b93a6d4131d8aa6dc3d1e07cab0.png

Fig.1

After BEVFormer, everyone shifted the focus of BEV perception to timing modeling, and SOLOFusion demonstrated the unparalleled performance improvement of long timing modeling. So is there still room for improvement in the existing timing modeling? For this reason, we divide the existing methods into two types:

82cb2c82e577cfaede55a88ca799e216.png

method

a7ab542892abe9b2d38c3f10e673602a.png

The model of StreamPETR is mainly divided into three parts: an image Encoder, a memory queue in the form of Queue, and a propagation transformer. Here we mainly introduce the Memory Queue and propagation transformer related to timing fusion:

Memory Queue: The size of the Queue is NxK, N is the number of frames stored, and K is the number of objects stored. The first-in-first-out (FIFO) update mechanism is adopted, and the information of TopK foreground targets is generally selected for storage according to the classification score. Specifically, each memory space contains the time interval t corresponding to the object, the semantic embedding Qc, the object center point Qp, the velocity v and the attitude matrix E. In the experiment, we choose K=256, which saves nearly a hundred times the storage space compared to the BEV method.

Propagation Transformer: Propagation Transformer We introduced two concepts, Motion-aware Layer Normalization (MLN) and Hybrid attention. MLN implements implicit motion compensation, and Hybrid attention implements RNN-style timing interaction.

9e8a2a7ef438d224281f620759c3cdc3.jpeg

MLN

The design of MLN stems from our previous research on GAN. In the field of generation, similar condition normalization is widely used. Therefore, we adopted the same network structure design and changed the affine transformation parameters of the normalization layer into dynamic ones.

Hybrid Attention

Hybrid Attention is used here to replace the native self-attention. First of all, it plays the role of self-attention, suppressing the repeated frame of the current frame. Secondly, the object query of the current frame also needs to perform a cross attention operation similar to the object query of the historical frame for timing interaction.

Since the hybrid queries are much smaller than the number of image tokens in the cross attention, the additional computation is negligible. In addition, the historical object query will also be passed to the current frame to provide better initialization (propagate query) for the current frame.

experiment

NuScenes dataset detection and tracking performance:

b11b38d20afa478262ef837304be572c.jpeg

The first is the detection performance on the test set, although ranked second here, we are the only one that does not use future frames. Referring to the performance improvement of VideoBEV's future frames (about 4% NDS), the performance of other methods has actually been opened up by StreamPETR. Furthermore, we note that we have the strongest angle prediction ability (mAOE), and our angle error is the lowest even compared to multimodal methods.

7757de533390d9e98563714b5047d754.png

e3bcb9d892fce03551bd5ef9231da928.png

Since no one on the Tracking list uses future frames, this performance advantage will be more obvious. We use the same tracking strategy as CenterPoint, StreamPETR is 8.9% AMOTA higher than the second-ranked Camera-based method, and 1.5% AMOTA higher than Lidar-based CenterPoint.

Ablation experiments:

1) Comparative test for Queue length

8d5e8333a75b834defa80803efc33d56.jpeg

Increasing the number of training frames is mainly used to solve the problem of inconsistency between the number of training frames and the number of test frames. The 8-frame streaming video test exceeds the sliding window method, and the performance of 12 frames is almost saturated.

e8c7981e0da4d724cbace63c7c5445e3.jpeg

For this phenomenon, we count the tracking of nuscenes in rebuttal. If it is said that the long time series mainly solves the occlusion of objects, then the number of frames of object occlusion in tracking indicates the number of long time series saturated frames. From the figure above, we can see that in the nuscenes data set, the number of frames occluded by most objects is within 10 frames, which verifies our hypothesis.

2) Comparative test for MLN

6582f9dfe7bfc91d13f6f5ce64afc667.jpeg

There is no performance improvement when using explicit motion compensation (MC). This is a bit counter-intuitive. In fact, to a large extent, vanilla attention itself can solve most dynamic modeling problems, so the role of motion compensation itself will not be great. Implicit encoding of MLN's ego poses, where mAP improves by 2.0% and NDS improves by 1.8%. Continuing to use MLN encoding of time offset Δt and object velocity v, both mAP and NDS increase by 0.4%.

3) The impact of memory queue on performance

d2563a01c02d1c2268d2cc788b194c3f.jpeg

mAP and NDS increase as the number of memory queue frames increases, and begin to saturate at 2 frames (nearly 1 second). This shows that this recurrent long-term timing modeling does not require a large number of memory queue frames. The reason why 1 frame is not as effective as 2 frames is because only K=256 object query is saved in each frame. If 512 or 900 are saved, this performance gap will disappear.

4) Comparison of different timing fusion methods

We compared the perspective timing interaction method of PETRv2. It can be seen that the StreamPETR method is superior to perspective interaction in terms of speed and accuracy. And there is no increase in the interaction of using object query and perspective at the same time.

3eec4c00728beef3a6f5bcd6f43501a2.jpeg

In my understanding, the object query obtained through the attention mechanism is similar to the low rank decomposition of multi view image features (refer to Is Attention Better Than Matrix Decomposition). Although only 256 object queries are saved, the image features are actually saved. Most of the effective information of , so there will be no further improvement when the two features are used at the same time.

5) Scalability

1df4ae035d43e640b4169685a298f8a7.jpeg

We conduct experiments on DETR3D and achieve 4.9% mAP and 6.8% NDS improvements with little impact on inference speed. Compared with PETR, the improvement of DETR3D is small, and it is actually the local spatial attention adopted by DETR3D that limits the performance. We privately use deformable attention to increase the spatial sampling points of DETR3D, which can achieve the same or slightly higher performance of StreamPETR.

Summary and reflection

Since PETRv1, we have been pursuing simple and efficient work, which is convenient for everyone to use. The speed and accuracy of StreamPETR have great advantages over previous methods in 3D detection. Not only that, in the actual open source process, we provided the same iter-based training method as SOLOFusion, so the actual training time is exactly as efficient as the single frame time. Most of the experiments in this paper can be trained on 2080ti. In theory, our ViT-L large model can be trained on a 3090-level graphics card without modifying the batch size. StreamPETR is the end of PETR series, hope you like our work!

  • How to get faster than paper?

You can change the number of FFN channels of the transformer from 2048 to 512, and use flash attention for testing. Also, the speed of the paper incorporates post-processing. The number of queries can be reduced to about 300 within a range of 50 meters.

  • How to increase mAOE?

The Transformer itself can reduce the angle error, and the query denoise itself also contributes to mAOE. We did not adjust the parameters in a targeted manner, and ViT+DN played a better chemical reaction.

  • What are the precautions for flash attention?

Flash attention needs to enable FP16 training and does not support V100. On PETRv1, it will not converge when encountering Multi-view 2D PE. Flash attention can greatly reduce video memory, and it is also recommended to use flash attention when training other PETR-based frameworks, such as PFTrack.

  • What are the precautions for Streaming training?

Streaming training is not as easy to converge as sliding window training, so it is best to have auxiliary supervision for Streaming training. For example: 2D supervision, depth supervision, query denoising, etc.

  • What are the possible difficulties in the actual deployment of StreamPETR?

General training is carried out on key frames (about 2hz), but non-key frames (10-25hz) need to be introduced in actual testing, while StreamPETR uses time PE, the interpolation and extrapolation performance of time PE is poor, and the training and testing are inconsistent Will lead to a large performance drop. You can consider changing the size of the memory queue from NxK to 1xK (for example, the setting we used in the original paper is 4x256, which can be changed to 1x900, and the size of the propagated query remains unchanged at 256), which has no major impact on performance. Or the introduction of timing augmentation with random time intervals can also alleviate such problems (refer to the use of BEVDET4D or PETRV2 for sweep).

references

DETR:End-to-End Object Detection with Transformers
PETR: Position Embedding Transformation for Multi-View 3D Object Detection
BEVFormer:BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
Sparse4D:Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion
PETRv2:PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images
SOLOFusion:Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection
 
  

Click to enter —>【3D Object Detection】WeChat exchange group

ICCV/CVPR 2023 Paper and Code Download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
3D目标检测交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-3D目标检测 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如3D目标检测+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch78d0927e693678912b18ad9b24aa60aa.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131950603