A long article of 10,000 words talks about BEV perception in autonomous driving

Author | Qidao Daguang Editor | Autobot

Original link: https://zhuanlan.zhihu.com/p/674854831

Click on the card below to follow the " Heart of Autonomous Driving " public account

ADAS giant volume of dry information is now available

Click to enter → Heart of Autonomous Driving [BEV Perception] Technology Exchange Group

This article is for academic sharing only. If there is any infringement, please contact us to delete the article.

prologue

This may be the longest series of articles. Let me talk about why first. On the one hand, it takes time to see the improvement effect of segmenting large models on small models. On the other hand, we have been working on the BEV algorithm for autonomous driving for a long time. You should also sort out the preliminary research.

(Many things are interrelated and need to be decided. For example, whether dividing the large model will improve the performance of the small model, whether the input and output of the effect improvement are appropriate, whether the improvement is not big, and whether it is necessary to continue doing it, is to change the direction. Let’s continue to explore and continue to explore how many resources are available to support it. There are too many. When will bev’s pre-research be possible? Are there any opportunities for the papers I have read to be transformed into output? Can the output bring benefits to myself? Income, these are also problems. Although the problem seems very complicated, it is just a hidden Markov or CRF, everything is still changing, just wait for the boots to fall)

In fact, the automatic driving BEV algorithm is a very broad term. BEV is bird's-eye-view, which refers to the paradigm of doing tasks from a bird's-eye view. From the data point of view, there are pure visual BEV, pure radar BEV, and visual radar or Other multi-sensor fusion bev.

There are many explanations for this. For a company project, what kind of technical route to take, what kind of sensors to use, and what platforms to run on all have requirements for corresponding algorithms. Under various compromises in reality, it is the most difficult. The algorithm is not necessarily the best, and the algorithm with high indicators is not necessarily usable.

Let me briefly talk about it first, and also share some of my paper lists with you:

camera bev :

1. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

  • Comments: The originator of pure visual bev. In subsequent articles, whenever it is useful to upgrade a two-dimensional image feature to a three-dimensional space and then flatten it into the bev space, the lift-splat operation proposed by him will be used. The code is also very simple and easy to read. . (The shoot part is for planning, the code has not been developed.) The core is to separately upgrade each image into a characteristic view cone, and then flatten all the view cones onto the rasterized BEV grid, as shown in the figure below.

f559c3a52bb438e2c38bcac3970ec1bc.png

Of course, since it is the first-generation version, there are inevitably some problems. The two most important points are that the calculation time in the splat process is time-consuming, and the second is that in the depth estimation part, the depth is not used when giving the possible depth of each point. True value supervision results in insufficient accuracy. The farther away the effect is, the worse the performance is. Compared with some of the newer algorithms now, the performance is much worse. But this still does not hinder its importance in the bev algorithm.

Off-topic: I have also optimized lss. It’s a long story, so I can recommend some of my previous related blog posts. To put it simply, I had thought about making a list of nuscenes before, which would be good for the company and myself, so I planned to start with LSS. At that time, I mainly considered it to be plug and play, and there were many bev algorithms based on LSS. If improvements can be made on such a basic algorithm, it can improve all algorithms that use the lss module. This is a meaningful thing. I won’t talk about the rest, it’s all sweat and exhaustion!

One is the interpretation of the code

One is nuscenes sota which improves LSS

One is the nuscenes depth map I made before

2. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird's-Eye View Representation

  • Comment: This is an article without open source. Like lss, it is also the work of nvidia. It also projects from 2d to 3d and then to bev space. However, since it is assumed that the depth distribution along the ray is uniform, this means that all volumes along the camera ray Each pixel is filled with the same features corresponding to a single pixel in 2D space, so the memory is efficient and faster. Judging from the article alone, the effect is better than lss. It is claimed to be the first to use a unified framework to perform detection and segmentation tasks at the same time, but it is not open source.

  • One point here is that this article assumes that the depth distribution along the ray is uniform, meaning that all voxels along the camera ray are filled with the same features corresponding to a single pixel in P in 2D space. The benefit is that this unified assumption improves computational and storage efficiency by reducing the number of learned parameters. But this is not the case with lss. It is a non-uniform depth distribution.

ba05ca27afeeec7ff97b49b80b264e00.png

3. BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View

  • Comments: The lss algorithm is a latecomer, or it has been covered with a layer of skin (a bit straightforward), and the pre-processing and post-processing core has been improved. Previously, lss was used for segmentation tasks, but this article is used for detection tasks. This work was done by an intelligence robot, and there will be a series of articles and improved versions to follow. Although, starting from here, many subsequent articles will optimize and accelerate the so-called "edge summation" part of lss.

Off-topic (+1): The founders of Intelligent Robot are all from Tsinghua University. We met in the second half of 2021. At that time, they only had a few dozen people when they started the business. They said they would do sensory post-processing. , it’s a pity that I was still young at the time and didn’t like each other. I will see later that they have indeed done a lot of meaningful work, thumbs up! However, the current autonomous driving environment is also average, and everyone is not having an easy time.

82bc7084dd7000a6fb4345c16cc853c8.png

Here’s a look at the development history of the bevdet series:

d0faebdee03dcf09c9a72a38fe94406b.png

After bevdet was made, bevdet4d (BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection) was mainly added to the timing, and the characteristics of the past frame and the current frame were merged to improve accuracy, which was quite interesting. The main reasons are: when doing feature fusion, BEVDet4D, based on BEVDet, retains the intermediate BEV features of past frames and fuses the features by aligning and splicing with the current frame. In this way, BEVDet4D can obtain temporal cues by querying two candidate features with only a negligible increase in computation. Simplifying the task, BEVDet4D simplifies the velocity prediction task by removing self-motion and time factors. This enables BEVDet4D to reduce velocity errors, making vision-based methods comparable for the first time to those relying on LiDAR or radar. The performance improvement of BEVDet4D mainly comes from more accurate speed estimation1. In addition, the mAP on the small model has increased to a certain extent, mainly because the small resolution cannot cover a range of 50 meters, and historical frames provide certain clues.

In the future, bevpoolv2 will mainly focus on speeding up lift-splat. It uses preprocessing to optimize the calculation process, achieving a speed increase of 4.6 - 15.1 times, while also reducing memory usage. This processing is indeed very advanced, thumbs up! It is really important for engineering implementation.

48df7091e0aec26d8504e495047da458.png

As of December 22, 2023, the latest development of the bevdet series is DAL (Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection), which fuses vision and lidar, with improved effects and speed. If you put it directly, the process is similar to that of bevfusion (MIT version), except that it trusts radar more in the branch of visual radar, and then includes a story of "imitating the data annotation process". In a relatively superficial way, this is it. Of course, this does not prevent others from achieving good results, and a lot of detailed work has been done. I am also looking forward to their next work.

The following is the frame diagram in the DAL paper:

5700242d7e15848f2781ad054e9b4930.png

4. BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

  • Comments: Use transformer and temporal to do bev perception tasks (detection and segmentation) to improve performance. There are some designs such as bev queries, spatial cross-attention, and temporal self-attention. The main thing to note is that the two attentions here are not classic transformers, but DeformAtt. I remember that I wasted a little effort when looking at the whole process, and the code is relatively complicated.

But you can also start here and take a look at the development of detr. The bev queries here are randomly generated and initialized. Then I remember that there was an article about fb-bev or fb-occ during the nvidia ranking cvpr 2023 challenge. There have been improvements in the initialization of queries, so you can learn quickly. a little.

baacb3ab09e2e8cac7c10e96111b626b.png

The above is the source of the paper, I am not just blabbering! This article is not included here mainly because fb-occ was originally used for occupying the grid. Later, the fb-bev effect released by bev was not very amazing. In addition, the work of ranking was not for the product. It is designed based on the forward projection. The speed of checking the backward projection (2d-3d, 3d-2d) is of course very slow, so I didn’t mention it. (But there is also an element of laziness)

1061e93d163cc5f0304bafe448751aab.png

There will be some improved versions in the future, such as BEVFormer++, bevformerv2. Of course, this work is not done by the original team, so I will briefly talk about bevformerv2. The main reason for this article is that it is not open source. The paper said it will be opened but it is not yet available, so I can only take a look at the idea; bevformer++

It is a technical report. The original team won first place in the Waymo Open Dataset Challenge 2022. You can take a look.

5. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection

  • Comment: bevdepth is a piece of work that I appreciate very much. Based on lss, radar is used as the true value of depth. The depth estimation network takes into account the internal parameters of the camera. The depth refinement module specifically optimizes the depth estimation part. More importantly, The "cumsum trick" of the original lss is optimized using GPU parallelism, which greatly improves the speed of the lss link. It is 80x faster than the original lss under the GPU. It really improved the effect of lss depth estimation to a new level (in fact, this was also the source of inspiration for me to write an article on lss depth at that time. The word ps is not important)

506f5df2c0524ad5a8d51a4df7e048c9.png 67a3ab9ad1548c92d357747e0e629125.png

6. CVT:Cross-view Transformers for real-time Map-view Semantic Segmentation

  • Comments: The principle of front view projection top view, using the attention mechanism of transformer to perform cross-view segmentation.

6d7110c00e04d45f4389741b47e96b72.png

7. GKT:Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer

  • Comment: I have commented on the core code https://blog.csdn.net/weixin_46214675/article/details/131169769?spm=1001.2014.3001.5501 before. It is not a very famous paper. Others may be cvpr, eccv, and iccv. Or something like that, the main reason why I started doing this was that there was a project on Horizon's chip at that time, and they had support for this algorithm. Then it turned out that it was done by someone while I was an intern at Horizon. So far, I haven't seen any successful drafts. I think it’s normal for them to support it with their own stuff. If you are interested, you can take a look for yourself. To put it simply, bev projects the image, and then focuses on it within a certain range around the projection. Of course, there are also other things such as adding some noise to the outside to make the network more robust, and building a lookup table to speed up the process.

lidar it:

For radar, bev is actually not that necessary. Radar bev is closely related to detection. How many paradigms are there for generally using radar for detection? I don’t know how to say it, or it can be done based on points, such as pointnet, pointnet++; if it is done based on the perspective, it can be a top view, which is bev, or a front view. These two perspective changes can also be seen when doing radar image fusion. ; it can also be done based on voxels. Of course, this is not accurate. No matter what paradigm, there is always something to learn from each other.

1. pointpillar

  • Comments: https://blog.csdn.net/weixin_46214675/article/details/125927515?spm=1001.2014.3001.5502 Qian Qian made a ppt a long time ago. At that time, it was mainly for science popularization, so it was not very detailed, but It is indeed a useful paper for the industry, like it!

2. centerpoint

  • Comment: CenterPoint uses standard lidar-based backbone networks, VoxelNet or PointPillars. CenterPoint predicts the relative offset (velocity) of an object between consecutive frames and then greedily concatenates these frames. Therefore, in CenterPoint, 3D object tracking is reduced to greedy closest point matching. This detection and tracking algorithm is both efficient and effective. CenterPoint achieves state-of-the-art performance on the nuScenes benchmark. CenterPoint outperforms all previous single-model approaches on the Waymo Open Dataset.

  • It is better to learn from the past than to learn from the past. It is also a very good paper. It is still used in the industry. Like it!

fusion bev:

1.BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [code]

c42a94590b1711bd74bbf0e5d61df625.png
  • Comment: bevfusion is a bev algorithm for image lidar fusion, which supports simultaneous detection and segmentation tasks. In nuScenes detection task

There are many variations of this method on the list. From the overall architecture, the basic method is very clear. The camera branch and the lidar branch extract features respectively, and then transfer them to the shared bev space, and then extract the bev features to do multi-tasking. It is really said that It is the most simple way. It is worth mentioning that bevfusion has done a lot of work on optimizing the camera to bev part. The pre-calculation part has been reduced by 13ms, and the GPU calculation interval optimization has been reduced by 498ms. This optimization has been achieved to the extreme. Many companies cannot do this.

2d5e21ee0d2fb5e9ef8c8eca1708aa1a.png

In addition, we can also see the power of bevfusion from the experimental comparison part of the paper. Bevfusion is really an excellent framework for many engineers who want to learn bev. As the team said, they hope that BEVFusion can be used as a simple but powerful baseline and inspiration. Future research on multi-task multi-sensor fusion. But unfortunately, although the team has done a lot of work on speed, it still cannot achieve real-time. (8.4 FPS)

In actual engineering projects, if you want to use bevfusion, the biggest problem is actually optimization, because the author's optimization is based on CUDA. Once you leave the nvidia platform, many places will be affected. This part needs to be targeted at different platforms. to write customized operators.

f5f741ddbfd163de30628063e6cdf1b2.png

2. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection [code]

6a522dcba6945cba2c7479f8f400b74a.png

Of course, algorithms are always being improved and updated, and new algorithms are often faster and stronger. For example, CMT is also on the nuscenes list. Now that we've talked about it here, let's briefly talk about the essence of cmt. It really means understanding the position coding.

For the image, first multiply the pixel by the inverse of the internal parameter and then the external parameter to transfer it to the radar coordinate system, and then use mlp to output its position code.

d590486c5421a0b65b8e61d008e09774.png

For radar, voxelnet or pointpillar is used to encode the point cloud token and then simply sample along the height on the bev feature gird, and then use Mlp to output its position encoding.

64bb834199b38618f72257f9a8f5c172.png

Position-guided Query first randomly selects point A. A: (ax,i,ay,i,az,i) is randomly generated between [0,1], and then multiplied by the range plus the minimum value to the region of interest (RoI) of 3D world space.

c70d8252ddd545fb9cc95e574829a8b4.png

Then project this coordinate into the image and radar modalities to obtain the corresponding position code, and add them to obtain the Q query

ec73ededf4ae8cd0c3b2269451408fdd.png

The code is as follows: This reference point can be learned

926804464e7a2c4bf9a798c87af79db9.png

The most important operation is this. It is also written very clearly in the paper that bevfusion is two modalities transferred to bev space and then spliced. Transfusion first generates Q in the lidar feature and takes top-k and then checks the image features. In CMT, object queries directly interact with multi-modal features simultaneously, using position encoding to align the two modalities. Of course, good results have been achieved.

0d3d1c6caff87e4e60cd65fa40af4c0b.png

3.TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers [code]

  • Comments: Of course, transfusion is also a good work, and in addition to detection, it also does tracking. At that time, it ranked first in the nuScenes tracking rankings. A major highlight is the use of a soft correlation mechanism that can handle situations of poor image quality and sensor misregistration. This mechanism enables the model to adaptively determine where and what information to obtain from the image, thus achieving a robust and effective fusion strategy. Simply put, cross-attention is used to establish a soft association between lidar and images. Translated the paper is:

(Spatially Modulated Cross Attention, SMCA) module weights cross attention by using a 2D circular Gaussian mask around the projected 2D center of each query. The weight mask M is generated in a similar way to CenterNet [66], using the following formula:

. where is the spatial index of the weight mask M, is the 2D center calculated by projecting the query prediction onto the image plane, is the radius of the minimum circumscribed circle of the 3D bounding box projection angle, is the hyperparameter used to adjust the bandwidth of the Gaussian distribution . This weight map is then element-wise multiplied with the cross-attention maps between all attention heads. In this way, each object query only focuses on the relevant area around the projected 2D box, allowing the network to better and faster learn to select the location of image features based on the input LiDAR features.

The problem of course is that if the image quality is extremely poor, for example in poor lighting conditions, its performance may suffer. Due to the use of the Transformer decoder, computational complexity and memory requirements will increase. (But after all, it is not easy to run fusion's bev in real time, not to mention the subsequent occupation of the grid)

It can also be seen from the series of comparison tables above that there are still many papers related to bev. If you are interested, you can find a related article and then find a series of articles of the same type from the comparison table. In addition, this blog is just a rough introduction to some papers that I have seen that are relevant to me. It is also just the beginning. I will update it one after another when I have time.

Interesting thing

Now that I have written this, I will share something interesting that I find interesting, that is, there is something that seems very counter-intuitive. From the three algorithms of fusion bev alone, cmt obviously has good performance and speed, followed by bevfusion, and finally It's transfusion. But if you look at the github of these three, you will find that the best cmt actually receives the least attention, transfusion has the most forks, and bevfusion has the most stars. The following data is as of December 27, 2023.

a162f797ea7e9e4ed8d2e6c862a49a72.png
  • Transfusion 1.4k fork,539 star

f1a698b6daab318b4d963641dfcc7840.png
  • CMT 30 fork,259 star

53aed73a9adb592b1c143de75c14c7a3.png
  • bevfusion 322 fork,1.8k star

As for why this situation occurs, everyone can express their own opinions. As the saying goes, the three stooges are better than Zhuge Liang. It reminds me of when I first studied social psychology, and it was very interesting. After talking about it for so long, it finally has tens of thousands of words. This can be regarded as a long article of ten thousand words, hahaha. Everyone is welcome to share anything they want to talk about! I'm actually thinking about doing nlp, and then Embodied AI, but I'm still hesitant, but I think it can be done, and there's nothing I can't do!

The contributing author is a special guest of " Autonomous Driving Heart Knowledge Planet ", welcome to join the exchange!

① Exclusive video courses on the entire network

BEV perception , millimeter wave radar vision fusion , multi-sensor calibration , multi-sensor fusion , multi-modal 3D target detection , lane line detection , trajectory prediction , online high-precision map , world model , point cloud 3D target detection , target tracking , Occupancy, CUDA and TensorRT model deployment , large models and autonomous driving , Nerf , semantic segmentation , autonomous driving simulation, sensor deployment, decision planning, trajectory prediction and other learning videos ( scan the QR code to learn )

f5d5276ad6a940d5eec84c6717f78e5d.png Video official website: www.zdjszx.com

② The first autonomous driving learning community in China

A communication community of nearly 2,400 people, involving 30+ autonomous driving technology stack learning routes. Want to know more about autonomous driving perception (2D detection, segmentation, 2D/3D lane lines, BEV perception, 3D target detection, Occupancy, multi-sensor fusion, Technical solutions in the fields of multi-sensor calibration, target tracking, optical flow estimation), autonomous driving positioning and mapping (SLAM, high-precision maps, local online maps), autonomous driving planning control/trajectory prediction, AI model deployment and implementation, industry trends, Job postings are posted. Welcome to scan the QR code below and join the Knowledge Planet of the Heart of Autonomous Driving. This is a truly informative place where you can communicate with industry leaders about various problems related to getting started, studying, working, and job-hopping, and share papers and code on a daily basis. +Video , looking forward to communication!

cbe8f6bb34d7e5479790d86190357b21.png

③【Heart of Autonomous Driving】Technical Exchange Group

The Heart of Autonomous Driving is the first autonomous driving developer community, focusing on target detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, target tracking, 3D target detection, BEV perception, multi-modal perception, Occupancy, Multi-sensor fusion, transformer, large model, point cloud processing, end-to-end autonomous driving, SLAM, optical flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment and implementation, autonomous driving simulation testing, products Managers, hardware configuration, AI job search exchanges , etc. Scan the QR code to add Autobot Assistant WeChat invitation to join the group, note: school/company + direction + nickname (quick way to join the group)

7bf069f0819dcc30ac2a33e9a8b2b3a9.jpeg

④【Heart of Autonomous Driving】Platform Matrix, welcome to contact us!

7fed7630e24c84edffb4806b05d9446a.jpeg

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/135421112