LATR:3D Lane Detection from Monocular Images with Transformer

Reference:LATR

Motivation and main work:
Previous 3D lane line detection algorithms used operations such as IPM projection, 3D anchor plus NMS post-processing to handle lane line detection, but these operations are more or less There will be some negative effects at least. IPM projection has requirements for depth estimation and accuracy of internal and external camera parameters. The anchor method requires some post-processing assistance such as NMS. The main contributions of this article are twofold:

  • 1) Based on the characteristics of lane lines, a detection method based on landline query is proposed based on the DETR target detection algorithm. In order to make the initialization of the query more reasonable, the SparseInst method is used to initialize the query with different instances from the 2D image domain and establish the lane line. The granularity of the query is not the lane line level but specific to the points on the lane line.
  • 2) It is difficult to learn the 3D information using image features as key and val. Construct a learnable 3D spatial position encoding when the internal and external parameters of the camera are known, and predict through multiple rounds of iterations in the decoder and fusion with image features. The residual method continuously corrects the 3D spatial position encoding.

The structure of the detector:
The method flow of this article is shown in the figure below:
Insert image description here
You can see from the figure above that after the backbone comes out Connect a lane line instance prediction network to achieve lane query generation and initialization. Position coding using 3D information embedding is used for image features, but this position coding is modified based on a given initialization, which means that the value of this position coding is dynamic during the transformer decoding process.

Lane line query construction:
This part refers to the construction process of inst feature in SparseInst. For details, you can check the corresponding paper, so that you can get the characteristic expression of lane line query< /span> Q l a n e ∈ R N ∗ C Q_{lane}\in R^{N*C} QlaneRNC (This is obtained from the feature with the largest feature map size) . For lane lines, which are composed of multiple points, you also need to construct a query for the above points. This is achieved by setting learnable parameters Q ∈ R M ∗ C Q_{}\in R^{M*C} QRMC. Then the next step is to use the broadcast mechanism to implement the query construction of the last lane line Q ∈ R ( N ∗ M ) ∗ C Q\in R^{(N*M)*C} QR(NM)C

The form of instance + query is the best:
Insert image description here

Position coding of image features:
The focus here is on the lane lines in the autonomous driving scenario. According to the distribution characteristics of the lane lines, the position coding can be set for the corresponding 2D image features. The position encoding here is to first sample in 3D space (that is, the 3D ground plane defined in the article), and then project it into the image through the internal and external parameters of the camera, as a 3D position source at the corresponding image position. It’s just that the 3D ground plane here is dynamically updated. Different update residuals are predicted in different layers of the transformer. The defined residual variables include rotation angle (yaw angle) Δ θ \Delta \theta Δθsum plane altitude Δ h \Delta h Δh, its prediction is achieved through a set of FC layers:
[ Δ θ , Δ h ] = M L P ( A v g P o o l ( G [ X , M p ] ) ) [\Delta \theta,\Delta h]=MLP(AvgPool(\mathcal{G}[X,M_p])) [Δθ,Δh]=MLP(A vgPool( G[X,Mp]))
inside, G , X , M p \mathcal{G},X,M_p G,X,MpRespectively represent the 2-layer rolling machine operation, image features, and the position encoding of the ground plane of the previous round.

Then the new round of plane points will be updated using the following matrix:
Insert image description here

, the 3D ground plane can be optimized through adaptive regression on the originally inaccurate 3D plane, thereby realizing the optimization of the feature 3D position encoding. In addition, the ground plane constraints are established using point projections on the crossing lane lines
L p l a n e = ∑ u , v ∈ P ∩ L ∣ ∣ M p [ : , u , v ] − M l [ : , u , v ] ∣ ∣ 2 L_{plane}=\sum_{u,v\in \mathcal{P}\cap\mathcal{L}}||M_p[:,u,v]-M_l[ :,u,v]||_2 Lplane=u,vPL∣∣Mp[:,u,v]Ml[:,u,v]2

The final effect is to make the green plane in the picture below close to the red lane line. However, as of 10.09.2023, this part of the code has not been released. The following figure shows that the ground plane will converge to the actual lane line position as iterations proceed:
Insert image description here
To analyze the role of position encoding, first look at the performance improvement brought by position encoding:
Insert image description here

From the above table, we can see that position encoding can indeed improve performance, whether it is a view frustum or a fixed plane encoding, but the dynamic plane encoding here is more suitable for lane lines, so there is 1 point improvement. This shows that accurate position coding helps to obtain better detection performance, and the plane optimization proposed in the article only has 2 degrees of freedom. Would more dimensions be better?

The impact of lane line query + position encoding on detection performance:
Insert image description here

The decoding part of the rear lane lines is consistent with the traditional DETR, and will not be expanded here.

Performance under different data sets:
OpenLane validation:
Insert image description here
OpenLane performance under different weather conditions
Insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/133692560