DG-BEV:Towards Domain Generalization for Multi-view 3D Object Detection in Bird-Eye-View

Reference code: no

1 Overview

Introduction: During the deployment of the BEV algorithm, camera changes will be encountered. This change includes the internal and external parameters of the camera. The internal parameters determine the size of the imaged object in the image, and the external parameters determine the imaging position of the object. In this regard, this article proposes a scheme for aligning internal and external parameters (the work is based on BEVDepth). The problem of object size caused by internal parameters is solved by depth compensation, and the change of imaging position caused by external parameters is solved by homography matrix mapping. At the same time, in order to make image features robust to camera internal parameters, a Domain Classifier is designed to improve the generalization ability of generated features. However, from the actual effect, the last improvement effect is not so obvious compared with the first two.

The difference in the imaging of objects by the internal and external parameters of the camera is shown in the figure below:
insert image description here
From the above figure, we can see that the internal and external parameters of the camera will cause differences in the size and position of the object imaged in the image. Only by aligning the internal and external parameters can we avoid serious drop points in the target domain.

2. Method implementation

The method of the article is actually to compensate and adapt the internal and external parameters, and at the same time add a method similar to GAN Loss to make the image features robust to the internal parameters, which corresponds to the three parts in the figure below:
insert image description here

Internal reference alignment (IDD):
The focal length of the internal reference of the camera plays a key role. It largely determines the imaging size of the object on the image. Therefore, there will be problems in the depth estimation of the algorithm based on the LSS scheme. The most direct idea is to compensate the depth scale, that is, to transform the network depth estimation result as follows: d = sc ⋅ dmd=\frac{s}{c}\cdot
d_md=csdm
The calculation of the current scale is described as:
s = 1 fx 2 + 1 fy 2 s=\sqrt{\frac{1}{f_x^2}+\frac{1}{f_y^2}}s=fx21+fy21
Among them, ccc is the benchmark of ref-camera. In fact, in addition to this method, you can also directly use the internal reference to handle this situation.

Extrinsic parameter alignment (DPA):
Due to the different vehicle models and sensor positions, their external parameters relative to the vehicle body coordinates will also change, which will lead to changes in the imaging position of the object. For this, the method of disturbance is used to simulate such changes, that is, random disturbance is added on the basis of the original camera rotation angle: P ^ i = ( yi +
Δ yi , pi + Δ pi , ri + Δ ri ) \hat{P}_i=(y_i+\Delta y_i,p_i+\Delta p_i,r_i+\Delta r_i)P^i=(yi+y _i,pi+p _i,ri+Δri)
Then look for several points on an object. In this paper, the 4 vertices + midpoint below GT are taken. For these 3D points, the projected position in the image according to the old and new external parameters isqqqq^\hat{q}q^, then if these points are still on the image, you can use the method of least squares to obtain the homography matrix of the image under the old and new external parameters:
q ^ = H q \hat{q}=Hqq^=Hq

Image Feature Internal Reference Robustness (DIFL):
Possible intervals for focal length [ α , β ] [\alpha,\beta][ a ,β ] is divided intoKKK份:
t i = α + ( β − α ) ∗ i K t_i=\alpha+\frac{(\beta-\alpha)*i}{K} ti=a+K( ba )i
Then you can go to K + 1 K + 1K+1 division threshold,K + 2 K+2K+2 division categories. The picture below isK = 4 K=4K=Example of 4 : Using the parameter θ \theta
insert image description here
in the networkθ to predict the internal parameter distribution of the image:
y = ϕ ( x , θ ) , y ∈ R 2 ( K + 1 ) y=\phi(x,\theta),y\in R^{2(K+1)}y=ϕ ( x ,i ) ,yR
The GT corresponding to 2 ( K + 1 ) is l = { 0 , 1 , … , K + 1 } l=\{0, 1, \dots,K+1\}l={ 0,1,,K+1 } , then the prediction result and GT loss function calculation is in the form of discrete cross entropy:
L ( y , l ) = ∑ k = 0 K + 1 γ ( k , l ) log ( P k ) + ( 1 − γ ( k , l ) ) log ( 1 − P k ) L(y,l)=\sum_{k=0}^{K+1}\gamma(k,l)log(P ^k)+(1-\gamma(k,l))log(1-P^k)L ( y ,l)=k=0K+1c ( k ,l)log(Pk)+(1c ( k ,l))log(1Pk )
Among them, interval division function:
γ ( k , l ) = { 1 if l ≤ k 0 , if l > k \gamma(k,l) = \begin{cases} 1 & \text{if $l\le k$} \\[2ex] 0, & \text{if $l\gt k$ } \end{cases}c ( k ,l)= 10,if lkif l>k 
Classification probability calculation:
P k = ey 2 key 2 k + ey 2 k + 1 P^k=\frac{e^{y_{2k}}}{e^{y_{2k}}+e^{y_{2k+1}}}Pk=ey2 k+ey2k + 1 _ey2 k
The gradient here is not to minimize but to maximize. The purpose is to make the generated image features robust to the internal parameters of the camera, which is realized by using the gradient reverse layer (GRL). For the understanding of GRL, you can refer to:

  1. What does Gradient Reversal Layer refer to?
  2. Gradient Flip Layer GRL

The impact of the above parts on performance:
insert image description here

3. Experimental results

Experimental results of migration under different datasets:
insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/131426749