BEV Column (2) Looking at the BEV Process from BEVFormer (Part 2)

The preface continues from the previous  chapter. In the last article, we introduced BEVFormer, an advanced BEV algorithm. In this article, we will delve into the implementation details of BEVFormer, aiming to help readers better understand the working principle and performance of BEVFormer.

Reproduction of this tutorial is prohibited. At the same time, this tutorial comes from Knowledge Planet [CV Technical Guide] More technical tutorials, you can join Planet Learning.

Transformer, target detection, semantic segmentation exchange group

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

CV's major direction columns and the most complete tutorials for each deployment framework

algorithm details

Input to BEVFormer

The inputs to BEVFormer are multi-view cameras and historical BEV features. Specifically, BEVFormer captures 3D scenes using multiple cameras and converts these camera views into Bird's-Eye-View (BEV) images.

In addition, BEVFormer also uses historical BEV features to capture the relationship between objects in 3D scenes.

Encoder for BEVFormer

BEVFormer uses a Transformer and temporal structure based encoder to aggregate spatio-temporal information from multi-view cameras and historical BEV features.

The encoder consists of two parts: a "Multi-View Encoder" for processing multi-view camera input, and a "Temporal Encoder" for processing historical BEV feature input.

Multi-View Encoder: Multi-View Encoder uses Transformer and spatial attention mechanism to aggregate spatiotemporal information from multi-view cameras.

The Multi-View Encoder first converts each camera view into a Bird's-Eye-View (BEV) image and takes these images as input.

The Multi-View Encoder then uses a Transformer and a spatial attention mechanism to fuse these images and generate a representation that includes all camera view information.

Temporal Encoder: Temporal Encoder uses Transformer and temporal attention mechanism to aggregate historical BEV features.

Temporal Encoder first takes historical BEV features as input, and uses Transformer and temporal attention mechanism to combine this information and generate a representation that includes all historical information.

BEVFormer query

After generating representations from the encoder, BEVFormer uses queries to find the spatial/temporal space and fuse spatiotemporal information accordingly.

This method uses a predefined grid-like BEV query to interact with spatial/temporal feature information to find and fuse spatiotemporal information, and can also help BEVFormer effectively capture the relationship between objects in a 3D scene to achieve better performance. expression.

BEVFormer's Attention Mechanism

In addition to using Transformer and temporal structure to aggregate spatio-temporal information, BEVFormer also uses an attention mechanism to further improve the overall performance of the system.

These two attention mechanisms, one is used for the attention mechanism between cross-camera views, called Spatial Cross-Attention, and the other is used for the attention mechanism between historical BEV features, called Temporal Self-Attention.

Spatial Cross-Attention is used for the attention mechanism between cross-camera views. Use this approach to query and aggregate spatiotemporal information accordingly. Then, the method uses a spatial cross-attention mechanism to compute the relationship between each camera view and other camera views, and generates a representation that contains information about all camera views.

Temporal Self-Attention is used for the attention mechanism between historical BEV features, and also combines the corresponding spatiotemporal information. Then, this method uses a temporal self-attention mechanism to compute the relationship between historical BEV features and generate a representation that includes all historical information.

Output from BEVFormer

After generating representations from the encoder and query, BEVFormer feeds these representations into the output layer for processing. Specifically, the output layer can be customized according to different tasks.

For example, in 3D object detection tasks, the output layer can convert representations into 3D bounding boxes and class probability distributions; in map segmentation tasks, the output layer can convert representations into map segmentation results.

It can be seen that BEVFormer is a Bird's-Eye-View (BEV) encoder based on Transformer and temporal structure, which can efficiently aggregate spatio-temporal features from multi-view cameras and historical BEV features, and generate more powerful representations.

BEV features generated from BEVFormer can be used simultaneously for multiple 3D perception tasks, such as 3D object detection and map segmentation.

The experimental results in the previous section show that BEVFormer achieves better performance in 3D object detection and map segmentation than other existing methods when evaluated on the KITTI dataset.

Experimental effect

Experimental results of the BEVFormer model on the nuScenes dataset demonstrate its effectiveness. When other conditions are the same, BEVFormer using timing features improves the NDS index by more than 7 points than BEVFormer-S without using timing features.

Especially when timing information is introduced, pure vision-based models can truly predict the moving speed of objects, which is of great significance for autonomous driving tasks.

We also proved that the model can improve the performance of 3D detection tasks by performing detection and segmentation tasks at the same time. The significance of multi-task learning based on the same BEV feature is not only to improve the efficiency of training and inference, but also to improve the efficiency of training and inference based on the same BEV feature. , the perception results of multiple tasks are more consistent and less prone to divergence.

Let's review the effect of BEVFormer again:

End of article

All in all, this paper proposes an algorithm model BEVFormer that uses pure vision for perception tasks. BEVFormer extracts the image features collected by the surround view camera, and transforms the extracted surround view features into the BEV space through model learning. The model learns how to convert the features from the image coordinate system to the BEV coordinate system, so as to realize 3D target detection and map Divide tasks and achieve SOTA performance.

 Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

[Technical Documents] "Building a pytorch Model Tutorial from Zero" 122-page PDF Download

QQ exchange group: 470899183 . There are big guys in the group who are responsible for answering everyone's daily study, scientific research, and code questions.

Model deployment exchange group: 732145323 . It is used for communication on model deployment, high-performance computing, optimization acceleration, and technology learning in computer vision.

other articles

One day online, 4k star | Facebook: Segment Anything

3090 single card for 5 hours, everyone can train exclusive ChatGPT, HKUST open source LMFlow

Efficient-HRNet | Will EfficientNet thinking + HRNet technology be stronger and faster?

Practical Tutorial|Analysis and Optimization of Common Reasons for Low GPU Utilization

ICLR 2023 | SoftMatch: Achieving Quality and Quantity Trade-off of Pseudo-Labels in Semi-Supervised Learning

Target detection innovation: a region-based semi-supervised method, some labels are enough (download of original paper attached)

CNN strikes back! InceptionNeXt: When Inception meets ConvNeXt

Interpretability Analysis of Neural Networks: 14 Attribution Algorithms

Painless increase: Practical Trick for target detection optimization

Explain in detail the three ways PyTorch compiles and calls custom CUDA operators

What should I do if the GPU memory is insufficient when training a model with deep learning?

CV's major direction columns and the most complete tutorials for each deployment framework

Computer Vision Introduction 1v3 Tutorial Class

Communication group in various directions of computer vision

Guess you like

Origin blog.csdn.net/KANG157/article/details/130673851