FSM:Full Surround Monodepth from Multiple Cameras

Reference code: None

introduce

The depth estimation task is a basic environmental perception task, and the 3D perception built on the basis can be more accurate and have stronger generalization ability. Monocular self-supervised depth estimation already has classic depth estimation models such as MonoDepth and ManyDepth, and this article explores multi-purpose self-supervised depth estimation, using mutual constraints between multi-cameras on the basis of monocular self-supervised depth estimation A multi-objective self-supervised depth estimation method is constructed. Specifically, in this method, each camera will predict its own depth map and pose information, relying on the overlapping relationship between the camera view angles, its own motion pose, and depth estimation results to construct a self-supervised loss. The use of multi-eye views is mainly used to constrain The poses generated from various perspectives, and it is precisely because of the use of external parameters that the network has the ability to perceive the real distance.

method design

Self-supervised depth estimation using spatial-temporal

In the monocular depth estimation task, the photometric reconstruction error is constructed by relying on the imaging results under different timings, and its typical form is:
L p ( I t , I ^ t ) = α 1 − SSIM ( I t , I ^ t ) 2 + ( 1 − α ) ∣ ∣ I t − I ^ t ∣ ∣ L_p(I_t,\hat{I}_t)=\alpha\frac{1-SSIM(I_t,\hat{I}_t)}{2}+ (1-\alpha)||I_t-\hat{I}_t||Lp(It,I^t)=a21SS I M ( It,I^t)+(1a ) ∣∣ ItI^t∣∣where
,I ^ t \hat{I}_tI^tIt is obtained after warp is estimated by the estimated pose and depth, and the warp process is recorded as:
p ^ t = π ( R ^ t → c ϕ ( pt , d ^ t , K ) + t ^ t → c , K ) \hat{p}^t=\pi(\hat{R}^{t\rightarrow c}\phi(p^t,\hat{d}^t,K)+\hat{t}^{ t\rightarrow c},K)p^t=p (R^tcϕ(pt,d^t,K)+t^tc,K )
Under the multi-camera system, in addition to using the timing information like the monocular system, the spatial information can also be introduced (because there are many overlapping viewing angles between two adjacent cameras of the multi-camera system), or the spatial and The timing information is mixed (that is, the current frame undergoes spatial warp after the timing warp). The transformation relationship of the multi-purpose system under different timing and space dimensions is shown in the figure below.
insert image description here
For the simultaneous engraving, the camerasiiThe image of i is mapped to camerajjj , that is, the following transformation relationship:
p ^ i = π j ( R i → j ϕ i ( pi , d ^ i ) + ti → j ) \hat{p}_i=\pi_j(R_{i\rightarrow j}\phi_i(p_i,\hat{d}_i)+t_{i\rightarrow j})p^i=Pij(Rijϕi(pi,d^i)+tij)
can be projected to the same moment by the estimated pose at different moments, then the temporal-spatial association can be constructed at the same moment:
p ^ it = π j ( R i → j ( R ^ jt → c ϕ ( pjt , d ^ jt ) + tjt → c ) + ti → j ) \hat{p}_i^t=\pi_j(R_{i\rightarrow j}(\hat{R}_j^{t\rightarrow c}\ phi(p_j^t,\hat{d}_j^t)+t_j^{t\rightarrow c})+t_{i\rightarrow j})p^it=Pij(Rij(R^jtcϕ ( pjt,d^jt)+tjtc)+tij)
Using such a relationship can make the overlapping area generate more response pixels, as shown in the figure below (the last line is the response area obtained by the temporal-spatial method):
insert image description here

Pose constraints between multi-cameras

Since the multi-cameras in the algorithm predict the pose independently, but these cameras are indeed in the same motion system, their pre-calibrated external parameter transformation relationship can still construct their constraint relationship. Then, for two adjacent cameras, their corresponding constraints in timing and space are:
X ˉ it → t + 1 = X j − 1 X i X ^ it → t + 1 X i − 1 X j \bar{X} _i^{t\rightarrow t+1}=X_j^{-1}X_i\hat{X}_i^{t\rightarrow t+1}X_i^{-1}X_jXˉitt+1=Xj1XiX^itt+1Xi1Xj
The above formula establishes the transformation relationship of different cameras in time sequence and space, but it should be noted that the above formula is problematic . Its principle should be based on the predicted poses of the surrounding cameras, and transform the predicted poses to the target camera through the calibrated external parameters, so as to constrain the pose estimation results and transformed poses of the target camera itself, that is, from The two components of translation and rotation make them approximate:
tloss = ∑ j = 2 N ∣ ∣ t ^ 1 t + 1 − t ˉ jt + 1 ∣ ∣ 2 t_{loss}=\sum_{j=2}^N ||\hat{t}_1^{t+1}-\bar{t}_j^{t+1}||^2tloss=j=2N∣∣t^1t+1tˉjt+12
rotation 分量上(rotation):
R loss = ∑ j = 2 N ∣ ∣ ϕ ^ 1 − ϕ ˉ 1 ∣ ∣ 2 + ∣ ∣ θ ^ 1 − θ ˉ 1 ∣ ∣ 2 + ∣ ∣ Φ ^ 1 − Φ ˉ 1 ∣ ∣ 2 R_{loss}=\sum_{j=2}^N||\hat{\phi}_1-\bar{\phi}_1||^2+||\hat{\theta}_1 -\bar{\theta}_1||^2+||\hat{\Phi}_1-\bar{\Phi}_1||^2Rloss=j=2N∣∣ϕ^1ϕˉ12+∣∣i^1iˉ12+∣∣Phi^1Phiˉ12

Mask for loss calculation

Two types of masks are used in loss calculation: non-overlapping and self-occlusion. For the first mask is determined according to the effective area of ​​the reconstruction error, the photometric reconstruction error will be obtained under the guidance of the mask under the temporal (different timing of the same camera) and spatial (same timing of different cameras).

For the second type, it is determined by the location of the device itself, and those parts are excluded in the calculation process, and the effect is shown in the figure below:
insert image description here

Experimental results

KITTI dataset:
insert image description here
DDAD dataset:
insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/131997676