Beyond BEV-LaneNet: An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection Tasks

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Today, the Heart of Autonomous Driving will share with you the latest progress in lane line detection, surpassing BEV-LaneNet! Our method obtains 2D and 3D lane predictions by applying lane features to image view and BEV features, respectively. If you have related work to share, please contact us at the end of the article!

>>Click to enter→ The heart of automatic driving [lane line detection] technical exchange group

Editor | Heart of Autopilot

397fbc35f72d2bfb9b825425d5fb23e2.png

Accurate detection of lane lines in 3D space is crucial for autonomous driving. Existing methods usually first convert image view features into bird's eye view (BEV) with the help of inverse perspective mapping (IPM), and then detect lane lines based on the BEV features. However, IPM ignores changes in road height, resulting in inaccurate view transitions. Furthermore, the two independent stages of the process may lead to cumulative errors and increased complexity, to address these limitations, we propose an efficient transformer for 3D lane detection. Unlike the vanilla transformer, the model incorporates a factorized cross-attention mechanism for simultaneously learning lane and BEV representations. This mechanism decomposes the cross-attention between image views and BEV features into mutual attention between image views and lane features and mutual attention between lanes and BEV features, both of which are supervised by GT lane lines.

Our method obtains 2D and 3D lane predictions by applying lane features to image view and BEV features, respectively. This allows for more accurate view transformations than IPM-based methods, since view transformations are learned from data with supervised cross-attention. Furthermore, the cross-attention between lane and BEV features enables them to adjust each other to detect lanes more accurately than two separate stages. Finally, the decomposed cross-attention is more effective than the original cross-attention, and the experimental results on OpenLane and ONCE-3DLanes demonstrate the state-of-the-art performance of our method!

Disadvantages of the current mainstream methods in the field

Lane detection is a key component of assisted and autonomous driving systems as it enables a range of downstream tasks such as route planning, lane keeping assistance, and high-definition (HD) map building. In recent years, deep learning-based lane detection algorithms have achieved impressive results in 2D image space, however, in practical applications, lane lines usually need to be represented in 3D space or Bird's Eye View (BEV). This is especially useful for tasks that involve interacting with the environment, such as planning and control.

A typical 3D lane detection pipeline is to first detect lane lines in the image view and then project them into BEVs. Projection is usually achieved by inverse perspective mapping (IPM) assuming a flat road surface, however, as shown in Figure 1, since the change in road height is ignored, IPM will cause the projected lane lines to diverge in the case of uphill or downhill or converge. To solve this problem, SALAD predicts the true depth of the lane line and its image view position, and then projects it into 3D space using camera in/out projection, however, the depth estimation is not accurate at a certain distance, which affects the projection accuracy!

72639eac805e530e2f3052631b4743c6.png

State-of-the-art methods tend to predict the 3D structure of lane lines directly from BEVs, they first convert image-view feature maps to BEVs with the help of IPM, and then detect lane lines based on the BEV feature maps. However, as shown in Fig. 1(c), due to the planar assumption of IPM, the ground truth 3D lane lines (blue lines) are not aligned with the BEV lane features (red lines) when encountering uneven roads. To address this issue, some methods represent virtual top-down views by first projecting ground-truth 3D lane lines onto the image plane, and then projecting them onto the flat road plane via IPM (red lines in the figure, paragraph 1(c)). Ground-truth 3D traces in . These methods then predict the true height of the lane line and its position in the virtual overhead view, and finally project it into 3D space through geometric transformation. However, the accuracy of the predicted height significantly affects the transformed BEV position, which affects the robustness of the model, and moreover, the separation of view transformation and lane detection leads to cumulative error and increased complexity.

To overcome these limitations of current methods, the paper proposes an efficient transferer for 3D lane detection. Our model incorporates a factorized attention mechanism to simultaneously learn lane and BEV representations in a supervised manner. This mechanism decomposes the cross-attention between image views and BEV features into mutual attention between image views and lane features and mutual attention between BEV and lane features. We supervise the decomposed cross-attention with GT lane lines, where 2D and 3D lane predictions are obtained by applying lane features to image view and BEV features respectively!

To achieve this, dynamic kernels are generated for each lane line according to the lane features, and then the image view and BEV feature map are convolved with these kernels, and the image view and BEV offset map are obtained respectively. The offset map predicts the offset of each pixel to its nearest lane point in 2D and 3D space, which is processed using a voting algorithm to obtain the final 2D and 3D lane points, respectively. Since view translation is learned from data with supervised cross-attention, it is more accurate than IPM-based view translation. Furthermore, lane and BEV features can be dynamically adjusted to each other through cross-attention, leading to more accurate lane detection than two separate stages. Our factorized cross-attention is more effective than vanilla cross-attention between image views and BEV features. Experiments on two benchmark datasets, including OpenLane and ONCE-3Lanes, demonstrate the effectiveness and efficiency of our method.

3D lane lines and transformer method

3D lane detection can be implemented in the image view. Some methods first detect 2D lane lines in the image view and then project them into the bird's-eye view. Various methods have been proposed to solve the 2D lane detection problem, including anchor-based, parameter-based, and segmentation-based methods. As for lane line projection, some methods use inverse perspective mapping (IPM), which will cause the projected lane lines to diverge or converge when facing uneven roads resulting from the planar assumption.

To solve this problem, SALAD predicts the true depth of the lane line and its image view position, and then projects it into the BEV with camera in/out projection. However, the depth estimation is inaccurate at a certain distance, which affects the projection accuracy, and other methods directly predict the 3D structure of lane lines from the image view. For example, CurveFormer applies a transformer to predict the 3D curve parameters of lane lines directly from image view features, and Anchor3DLane projects lane anchors defined in 3D space onto image view feature maps and extracts their features for classification and regression. However, these methods suffer from Low resolution limitations of distant image view features!

3D lane line detection under BEV

Another way of 3D lane detection is to first convert the image view feature map to BEV, and then detect lane lines based on the BEV feature map, where the view conversion is usually based on IPM. For example, some methods adopt spatial transformation network (STN) for view transformation, where the sampling grid of STN is generated with IPM. PersFormer uses a deformable transformer for view transformation, where the reference point of the transformer decoder is generated by IPM!

However, due to the planar assumption of IPM, the ground truth 3D lane lines are not aligned with the underlying BEV lane features when encountering uneven roads. To address this issue, some methods represent the ground-truth 3D road lines in a virtual overhead view by first projecting the ground-truth 3D lane lines onto an image plane, and then using IPM to project the result onto a flat ground. The real height of the lane line and its position in the virtual top view are predicted, and then projected into 3D space by geometric transformation. However, the accuracy of the predicted height can significantly affect the transformed BEV location and thus the robustness of the model. BEV LaneDet applies a multi-layer perceptron (MLP) to achieve better view translation, however its parameter size is very large!

attention in transformer

The attention mechanism in the transformer requires pairwise similarity computation between queries and keys, which becomes complicated for a large number of queries and keys. To address this issue, some methods only focus on a subset of keywords for each query instead of the entire set when computing the attention matrix. CCNet proposes an attention module that only acquires context information for all pixels along its criss-cross path, and Deformable DETR proposes an attention module that only focuses on a single pixel sampled around a learned reference point. A small number of key points. Swin-Transformer proposes a shifted window module that restricts self-attention to non-overlapping local windows while also allowing cross-window connections, and other methods apply low-rank approximations to speed up the computation of attention matrices. Nystromformer uses Nystrom's method to reconstruct the original attention matrix, which reduces the amount of computation. Nystromformer uses randomly sampled features for low-rank decomposition, while our method decomposes the original attention matrix into two low-rank parts according to the lane query, and each part can be supervised by GT, which is more suitable for 3D lane detection task , existing transformer approximations usually sacrifice some accuracy, while our method achieves better performance than the original transformer!

method introduction

This paper proposes an efficient transformer for end-to-end 3D lane detection, first introduces the overall framework, and then describes each component in detail, including an efficient transformer module, a channel detection head and a binary matching loss! The overall framework is shown in Figure 2, which starts with a CNN backbone to extract image-view feature maps from the input image. Then, an efficient transformer module learns lane and bird's eye view (BEV) features from image view features using a factorized cross-attention mechanism. Image view and BEV features add positional embeddings with respective positional encoders. Next, the lane detection head uses the lane features to generate a set of dynamic kernel and target scores for each lane line. These kernels are then used to convolve the image view and BEV feature maps to generate image view and BEV offset maps, respectively. These two sets of offset maps are processed with a voting algorithm to obtain final 2D and 3D lane points respectively, and to train the model, a bipartite matching loss between 2D/3D predictions and ground truth is computed.

768cea841b2e60b67be15bd28abb31d1.png

Efficient Transformer Module

As shown in Fig. 2, given an input image X∈R^{H_0×W_0×3}, we first employ the CNN backbone to extract the image view feature map F∈R^{H_a×W_a×C}, where Ha, Wa and C are the height, width and channel of F, respectively. The feature map F is added with positional embeddings E ∈ R^{H_a×W_a×C} generated by a position encoder (as described in Section 3.3), and then flattened to a sequence I ∈ R^{H_a×W_a×C}. A BEV query map T ∈ R^{H_b×W_b×C} with learnable parameters is initialized, which also adds another position encoder-generated position embedding P ∈ R^{H_b×W_b×C}, It is then flattened to the sequence B ∈ R^{H_b×W_b×C}.

After obtaining image-view features and BEV queries, a set of lane queries Q ∈ {R^{L×C}} with learnable parameters is initialized, representing L different lane line prototypes. Then the lane feature O ∈ {R^{L×C}} is learned from the cross-attention image view feature I and BEV query B, let O_i ∈ R^C denote that the i-th lane feature corresponds to the i-th lane query Qi, and Oi can Get it by:

15740dd10e34e3c8f04e66cac83538f4.png 6349d8d5d76f3d6cbeb9d2529df24857.png

Then, the BEV feature V is constructed according to the lane feature O, and the intersection attention is as follows:

29a646484334647c19229164ed91fc88.png

where gv( ) and fv( , ) have the same form as go( ), fo( , .) in Eq., respectively, except for their learnable weight matrices. In this way, the original cross attention between BEV feature V and image view feature I shown in Fig. 3 is decomposed into the cross attention between image view feature I and lane feature O, and the cross attention between lane feature O and BEV feature V Intersection attention!

b5e21874665d7f18533e0a4125937ea0.png

Compared with the original cross-attention, decomposed cross-attention provides three benefits. First, it achieves better view translation by supervising decomposed cross-attention on 2D and 3D ground-truth lane lines. 2D and 3D lane predictions are obtained by applying lane features O to image-view features I and BEV features V, respectively. Second, the dynamic adjustment between lane feature O and BEV feature V is realized by cross attention, which improves the accuracy of 3D lane detection. Third, it significantly reduces the amount of computation and improves real-time efficiency!

Similarly, image view features I are updated with object features O with cross attention as follows:

c3a4732b53da4d4a6a3825baa7ffa48c.png

Compared with the original self-attention, the decomposed self-attention achieves a dynamic adjustment between lane features O and image view features I through cross-attention, thus improving the accuracy of 2D lane detection. Furthermore, since both image view features M and BEV features V are constructed from object features O, they can be better aligned with each other!

Position Embedding

For the image view location embedding E, a 3D coordinate grid G ​​is first constructed in camera space, where D is the number of discrete depth bins. Each point in G can be expressed as pj = (uj×dj, vj×dj, dj, 1), where (uj, vj) is the corresponding pixel coordinate in the image view feature map F, and dj is the corresponding depth value. The grid G ​​is then transformed into a grid G′ in 3D space as follows:

e96f52196c0d1cf5032ace3c92a937c3.png

Predict the height distribution Z ∈ R^{H^b×W^b×Z} from T, indicating the probability that each pixel belongs to each height bin, and then obtain the position embedding P as follows:

8bf2e15640f4b89fa73fff205a9521ae.png

Lane detection head

First, two multi-layer perceptrons (MLP) are applied to the lane feature O to generate two sets of dynamic kernels K_a∈R^{L×C×2} and K_b∈R^{R×C×3} respectively. Then K_a and K_b are applied to convolve the image view feature M and BEV feature V to obtain the image view offset map R_a∈R^{L×Ha×Wa×2} and the BEV offset map R_b∈R^{L× Hb×Wb×3}. Ra predicts the horizontal and vertical offsets of each pixel in the image view to its nearest lane point; Rb predicts the offset of each pixel in the x and y directions to the nearest lane point in the BEV, as well as the actual height of the lane point!

Then, another MLP is applied to the lane features O to generate target scores S ∈ R^{L×(2+N)}, including background, foreground, and probabilities for N lane classes. Then the image-view offset map Ra and BEV process the offset map Rb with a voting algorithm to obtain 2D and 3D lane points, respectively. The process of Rb is shown in Algorithm 1 (similar to Ra, except that z and r are removed), the voting algorithm votes for the predicted lane points of all pixels, and then selects those points whose votes exceed the lane width threshold w to form the predicted lane line, and finally , only keep the predicted lane lines whose foreground probability exceeds the object threshold t as output.

be9e06825a63c67f71563fec256476ea.png

Bipartite Matching Loss

676fbcd064f7fd036b30e5167aeeaec9.png 915df77a109d2f3b929e81e945eb84fb.png 49a61788d0122bb754fb5a9036df8f88.png 6694dde4c68c02fafe2ce4b566362cf4.png 00c7bfeddd2365bcb5ca04f7b260f5b8.png

Experimental results

Experiments are conducted on two 3D lane detection benchmarks: OpenLane and ONCE-3DLanes, OpenLane contains 160K and 40K images for training and validation sets respectively! The validation set includes six different scenarios, including curves, intersections, nighttime, extreme weather, merge and split, and up and down. It annotates 14 lane categories, including road edge, double yellow solid lane, etc. ONCE-3D Lane contains 200K, 3K and 8K images for training, validation and testing respectively, covering different times of morning, noon, afternoon and night segments, including sunny, cloudy, and rainy weather conditions, as well as downtowns, suburbs, roads, bridges, and tunnels.

Regression with F-score, accuracy of matching lanes for classification, and edit distance to match predictions to ground truth, where lanes are predicted only if 75% of the y-positions are within a point-wise distance of less than the maximum allowed distance of 1.5 meters Only then is it considered a real lane. For ONCE-3D lanes, a two-stage evaluation metric is employed to measure the similarity between predicted lanes and GT lanes. First use the traditional IoU method to match the lane in the overhead view. If the IoU is greater than the threshold (0.3), use the single-sided chamfering distance to calculate the curve matching error. If the chamfering distance is less than the threshold (0.3 meters), the predicted lane is considered to be the real lane. .

ResNet-18, ResNet-34, EfficientNet (-B7) are used, and the pre-trained weights of ImageNet are used as the CNN backbone. The input images are augmented with random horizontal flips and random rotations, and resized to 368×480. The spatial resolution of the BEV feature map is 50 × 32, representing the BEV space with the range [-10, 10] × [3, 103] meters along the x and y directions, respectively. The BEV offset map is resized to 400 × 256 for final prediction. AdamW is optimized, the be tas are 0.9 and 0.999, and the weight decay is 1e−4. The bs size is set to 16, and we train the model for 50 epochs.

d0b29bd404adf432a580fc3d56df3f2e.png

The performance of the OpenLane dataset, the comparison results on OpenLane are shown in Table 1. Using ResNet18 as the backbone, our method achieves an F-score of 60.7, which is 10.2, 6.4 and 2.9 higher than PersFormer, Anchor3DLane and BEV LaneDet, respectively. The lowest prediction errors are also obtained in the x, y, and z directions, as shown in Table 2, and our method achieves the best performance in all six scenarios, demonstrating its robustness. For example, for the "up and down", "curve", "intersection" and "merge and split" scenarios, using ResNet-34 as the backbone, the F-score is 8.2, 7.9, 5.6 and 9.4 higher than BEV LaneDet, respectively. In Fig. 5, the qualitative comparison results on OpenLane are shown, including upill, downhill, curved and bifurcation scenarios, and the comparison results show that lane lines on uneven roads and lane lines with complex topologies can be handled well.

5b47d7b0383cf995feaba696b83cd0b7.png

The performance of the ONCE-3DLanes dataset and the comparison results of ONCE-3D lanes are shown in Table 3. With ResNet-18 as the backbone, our method obtained an F-score of 79.67, which is 15.60 points higher than SALAD, PersFormer and Anchor3DLane, respectively. With a score of 5.34 and 4.80, the lowest CD error is also achieved, demonstrating the good accuracy of the proposed method.

ae5a09d71aa6e01b9da62e7e18b21e18.png

The impact of attention decomposition, the paper compares the results of the proposed decomposition of attention and IPM-based attention. IPM-based attention is used in PersFormer for view transformation, where IPM is used to compute the reference point of the transformermer. As shown in Table 4, the original attention performs slightly better than the IPM-based attention, and the decomposed attention achieves 3.3 and 2.3 higher F1 scores than the original attention on OpenLane and ONCE-3Lanes, respectively. This is because decomposed attention allows for more accurate view transitions by using 2D and 3D GT to supervise the cross-attention between image views and lane features, and the mutual attention between lane and BEV features, and moreover, It improves the accuracy of 3D lane detection by dynamically adjusting between lane and BEV features!

81e74b4978b38bbdd9cd9fe831d09539.png

reference

[1] An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection.

(1) The video course is here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction, etc. Direction learning video, welcome to take it by yourself (scan code to enter learning)

10d421b3e6324eef384537eb2327b44d.png

(Scan the code to learn the latest video)

Video official website: www.zdjszx.com

(2) The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, exchange various problems in getting started, studying, working, and job-hopping with the big guys in the field, share papers + codes + videos daily , look forward to the exchange!

9c7787185423f55a51e7fdd1bfc3469c.jpeg

(3) [ Heart of Automated Driving ] Full-stack Technology Exchange Group

The Heart of Autonomous Driving is the first developer community for autonomous driving, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, light Flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job exchange, etc.;

6eb5c5800a77693696a85712a1a2cb9d.jpeg

Add Autobot Assistant Wechat invitation to join the group

Remarks: school/company + direction + nickname

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/131336169