What's next for BEV perception in autonomous driving?

Link: https://www.zhihu.com/question/538920658

Editor: Deep Learning and Computer Vision

Disclaimer: For academic sharing only, intrusion and deletion

At present, BEV perception seems to have become the mainstream on nuScenes/Waymo leaderboards, such as pure camera BEVFormer, sensor fusion TransFusion, etc., so is there any gap for BEV perception to fill? Or is there a common problem in BEV perception at present, where are the concerns of academia and industry, what are the differences, and what needs to be solved in the next step?

Author: Afan Afan https://www.zhihu.com/question/538920658/answer/2647885390

A few days ago, I was fortunate enough to go to Tianjin to participate in the Valse2022 conference. I listened to the introduction of BEVFormer work by Mr. Dai Jifeng from Tsinghua University at the venue. A brief summary is as follows: Report summary: With the continuous expansion of intelligent driving application scenarios, the accuracy of system information expression The degree requirements are further enhanced.

A qualified intelligent driving system needs to accurately represent the surrounding environment including road layout, lane structure, road users and other elements. However, the distance of the object and the depth information of the scene cannot be effectively presented in the 2D perception results. This information is the key for the intelligent driving system to make correct judgments about the surrounding environment. Therefore, 3D scene perception is the first choice for intelligent driving visual perception. Recently, Bird's-eye-view Perception (BEV Perception) for 3D object detection based on multi-view cameras has attracted more and more attention.

On the one hand, it is a natural description to unify and represent different perspectives under BEV, which is convenient for subsequent planning and control module tasks; on the other hand, objects under BEV do not have the scale and occlusion problems under the image perspective. How to elegantly obtain the feature description under a set of BEVs is the key to improve the detection performance. We propose BEVFormer, a new framework for surround-view perception, by using a spatio-temporal attention mechanism to learn environmental representations from a bird's-eye view to support multiple autonomous driving tasks.

Overall, BEVFormer obtains spatio-temporal information by interacting with temporal and spatial features using predefined rasterized bird's-eye view queries. To aggregate spatial information, a spatial cross-attention mechanism is designed, and each bird's-eye view query extracts spatial features in relevant regions under the camera view. For temporal information, a temporal self-attention mechanism is proposed to obtain the required temporal features from historical bird's-eye view features. It achieves 56.9 % NDS on the nuScenes dataset, which is 9.0% NDS higher than the previous best performance.

Speaker introduction: Dr. Jifeng Dai received his bachelor and doctor degrees in 2009 and 2014 respectively from the Department of Automation, Tsinghua University. From 2012 to 2013, he was a visiting scholar at UCLA. From 2014 to 2019, he worked in the Vision Group of Microsoft Research Asia (MSRA), where he served as a principal researcher and research manager. From 2019 to 2022, he worked at SenseTime Research Institute, serving as the head of the two secondary departments of basic vision and general intelligence, and executive research director.

His research interests are general object recognition algorithms and cross-modal general perception algorithms in computer vision. He has published more than 30 papers in top conferences and journals in the field, and has received more than 20,000 citations according to Google Scholar. In 2015 and 2016, he won the first prize of the authoritative COCO object recognition competition in the field, and the subsequent champion systems also used his deformable convolution module. During his work at SenseTime, he served as the technical director of the Honda-SenseTime autonomous driving R&D project. He is an editorial board member of IJCV, field chair of CVPR 2021 and ECCV 2020, public affairs chair of ICCV 2019, senior PC member of AAAI 2018, and young scientist of Beijing Zhiyuan Artificial Intelligence Research Institute.

In this speech, Prof. Dai Jifeng shared mainly based on their BEVFormer work. Existing image-view based methods The existing perception schemes fuse the output results of different networks and are based on a large number of rules and priors. Introduction and difficulties of BEV perception The existing BEV is a form of perspective expression in the context of multi-sensor fusion. BEV is a new feature optimal representation trend because it does not need to consider the issues of scale and occlusion.

Difficulties in BEV perception Depth estimation after viewing angle changes, GT data acquisition Different sensor feature fusion does not depend on camera parameters/Domain Adaption solution Transformer Project the features of different pictures to the same position under the BEV viewing angle according to the weight, and then use the transformer Query spatial/temporal information separately, and this method also supports other perception tasks, such as segmentation and other tasks. The experimental results and the Waymo competition ranked first with absolute advantage, far ahead of the second place.

Author: Ball Lightning
https://www.zhihu.com/question/538920658/answer/2942628096

Now bevformer and voxel pooling are evenly divided, but the perception range of this dense method will be limited by computing power.

The next step should be based on the spa‍rse representation, and the computational complexity will not be affected by the perception range at all. Especially how to do the timing under sparse, because the dense feature can easily do the spatial alignment of the two frames before and after.

The current dense method

Voxel pooling is not only dense on the hw dimension, but also dense on the depth, which is a waste of calculations, but the advantage is that the 2d feature is put into all the depth bins. Theoretically, the feature is different at different depths, and the depth estimation may be more accurate.

And bevformer is just the opposite, it is projecting 3D to 2D. In the distance, it often happens that a 3D point on the ray is projected to the same 2D point. In the distance, the depth estimation is often inaccurate, and a series of objects are often predicted. The phenomenon.

about representation

Ordinary 2d detection, from anchor based to point based (centernet), and then to the detr series, the essence of things related to deep learning has not changed, only the way we express the problem: representation.

The same is true for bev. From the 3d detection of different cameras, to the bev detection directly from the network, the only change is the representation.

The representation method of objects is developing in a more streamlined and elegant direction. A better representation than bev is also the development direction behind bev.

Therefore, the method of bev lies in the innovation of the representation method, and it has the same problems as it should have, such as inaccurate depth estimation, such as the corner case brought by the long-tail distribution.

From the point of view of information theory

A communication system includes source, coding, channel, decoding, and sink. For a neural network, the source is the network input, encoding and decoding are both network structures, and the sink is the output. It can be considered that there is no channel, or an identity information transmission channel.

The innovation of any of the following modules can become the next development direction of bev.

For the same model, the input fusion of multiple sensors such as camera and lidar is better than simply inputting the camera because more information is provided.

The same input, using different encoding, the effect is not the same. For example, the bevformer and voxel pooling just mentioned are different encoding methods. There are also MLP and IPM, which are not very effective.

bevformer v2, which can be considered as an improved decoding part, adds mono3d head as a part of the decoder reference points.

how to do sparse

Just talk about the general idea: init a set of reference points (for example, 900) in the bev coordinate system, and then cast to 2d points to 2d feature as a sample. The timing is to use this set of reference points to get the points of t-1 according to the ego pose, and go to the 2d feature of t-1 to sample.

There is no problem in principle, but there are still some details. For example, ref points are randomly initialized, which may be difficult to converge. This can be solved by adding a priori to ref points (it can be the method of bevformer v2, or the method of detr4d, or directly use radar data. The combination of these types may better). Timing also has the problem of insufficient dependence on historical frames. If you want to increase the number of historical frames, you need to do more samples.

Author: Wu Yufeng
https://www.zhihu.com/question/538920658/answer/2992140428

BEVFormer++ has multiple encoder layers, and each encoder layer follows the traditional transformer structure except three custom designs of BEV query, spatial cross-attention and temporal self-attention .

The BEV query is a learnable parameter of the grid shape , and features in the BEV space are queried from multi-camera views via an attention mechanism. Spatial cross-attention and temporal self-attention are attention layers for processing BEV queries , finding and aggregating spatial features of multi-camera images and temporal features of historical BEVs based on BEV queries.

In the inference stage, at time t, the multi-camera images are sent to the backbone network (such as ResNet101), and the features F_t under different camera views are obtained. Meanwhile, save the BEV feature B_t-1 at the previous timestamp t-1.

In each encoder layer, BEV query Q is first used to query temporal information from previous BEV features B_t-1 through temporal self-attention.

The BEV query Q is then used to query the spatial information from the multi-camera features F_t through spatial cross-attention. After the feed-forward network, the encoder layer generates improved BEV features as the input of the next encoder layer. After six stacked encoder layers, a unified BEV feature B_t for the current timestamp t is generated.

Taking the BEV feature B_t as input, the 3D detection head and the map segmentation head predict perception results such as 3D bounding boxes and semantic maps.

To improve the feature quality of BEV encoders, there are three main aspects as follows:

(1) 2D feature extractor

Techniques used to improve backbone representation quality in 2D perception tasks are also most likely to improve representation quality for BEV tasks. For convenience, feature pyramids widely used in most 2D perception tasks are adopted in the image backbone. The structural design of 2D feature extractors, such as state-of-the-art image feature extractors, global information interaction, multi-level feature fusion, etc., all help to better represent the features perceived by BEVs. Besides the structure design, the auxiliary task of supervising the backbone is also important to the performance of BEV perception.

(2) View conversion

This transformation introduces image features and reorganizes them into the BEV space. Hyperparameters include the sampling range, frequency, and BEV resolution of image features, which are critical to BEV perceptual performance. The sampling range determines how much of the frustum behind the image will be sampled into BEV space. By default, this range is equal to the effective range of the lidar callout. When efficiency has a higher priority, the upper z-axis of the viewing frustum can be compromised, since it only contains unimportant information such as the sky in most cases. The sampling frequency determines the utility of image features. A higher frequency ensures that the model accurately samples the corresponding image features for each BEV location at a higher computational cost. BEV resolution determines the granularity of representation of BEV features, where each feature can be precisely traced back to a grid in the world coordinate system. High resolution is required to better represent small-scale objects such as traffic lights and pedestrians. In view transformation, there are also feature extraction operations in many BEV perceptual networks, such as convolutional blocks or Transformer blocks. Adding better feature extraction sub-networks in the BEV space can also improve the perceptual performance of BEV.

(3) Timing BEV Fusion

Considering the structure of BEV features, temporal fusion in BEV space usually leverages the pose information of the ego vehicle to align features of temporal BEVs. During this alignment process, the motion of other objects is not explicitly modeled, requiring additional learning by the model. Because of the enhanced fusion of features from other moving subjects, it is reasonable to increase the perceived range of cross-attention when doing temporal fusion. For example, the kernel size of the attention offset can be enlarged in the deformable attention module, or global attention can be used.

☆ END ☆

If you see this, it means you like this article, please forward and like it. Search "uncle_pn" on WeChat, welcome to add the editor's WeChat "woshicver", and update a high-quality blog post in the circle of friends every day.

↓ Scan the QR code to add editor↓