Reference code: None
introduce
The depth estimation task is a basic environmental perception task, and the 3D perception built on the basis can be more accurate and have stronger generalization ability. Monocular self-supervised depth estimation already has classic depth estimation models such as MonoDepth and ManyDepth, and this article explores multi-purpose self-supervised depth estimation, using mutual constraints between multi-cameras on the basis of monocular self-supervised depth estimation A multi-objective self-supervised depth estimation method is constructed. Specifically, in this method, each camera will predict its own depth map and pose information, relying on the overlapping relationship between the camera view angles, its own motion pose, and depth estimation results to construct a self-supervised loss. The use of multi-eye views is mainly used to constrain The poses generated from various perspectives, and it is precisely because of the use of external parameters that the network has the ability to perceive the real distance.
method design
Self-supervised depth estimation using spatial-temporal
In the monocular depth estimation task, the photometric reconstruction error is constructed by relying on the imaging results under different timings, and its typical form is:
L p ( I t , I ^ t ) = α 1 − SSIM ( I t , I ^ t ) 2 + ( 1 − α ) ∣ ∣ I t − I ^ t ∣ ∣ L_p(I_t,\hat{I}_t)=\alpha\frac{1-SSIM(I_t,\hat{I}_t)}{2}+ (1-\alpha)||I_t-\hat{I}_t||Lp(It,I^t)=a21−SS I M ( It,I^t)+(1−a ) ∣∣ It−I^t∣∣where
,I ^ t \hat{I}_tI^tIt is obtained after warp is estimated by the estimated pose and depth, and the warp process is recorded as:
p ^ t = π ( R ^ t → c ϕ ( pt , d ^ t , K ) + t ^ t → c , K ) \hat{p}^t=\pi(\hat{R}^{t\rightarrow c}\phi(p^t,\hat{d}^t,K)+\hat{t}^{ t\rightarrow c},K)p^t=p (R^t→cϕ(pt,d^t,K)+t^t→c,K )
Under the multi-camera system, in addition to using the timing information like the monocular system, the spatial information can also be introduced (because there are many overlapping viewing angles between two adjacent cameras of the multi-camera system), or the spatial and The timing information is mixed (that is, the current frame undergoes spatial warp after the timing warp). The transformation relationship of the multi-purpose system under different timing and space dimensions is shown in the figure below.
For the simultaneous engraving, the camerasiiThe image of i is mapped to camerajjj , that is, the following transformation relationship:
p ^ i = π j ( R i → j ϕ i ( pi , d ^ i ) + ti → j ) \hat{p}_i=\pi_j(R_{i\rightarrow j}\phi_i(p_i,\hat{d}_i)+t_{i\rightarrow j})p^i=Pij(Ri→jϕi(pi,d^i)+ti→j)
can be projected to the same moment by the estimated pose at different moments, then the temporal-spatial association can be constructed at the same moment:
p ^ it = π j ( R i → j ( R ^ jt → c ϕ ( pjt , d ^ jt ) + tjt → c ) + ti → j ) \hat{p}_i^t=\pi_j(R_{i\rightarrow j}(\hat{R}_j^{t\rightarrow c}\ phi(p_j^t,\hat{d}_j^t)+t_j^{t\rightarrow c})+t_{i\rightarrow j})p^it=Pij(Ri→j(R^jt→cϕ ( pjt,d^jt)+tjt→c)+ti→j)
Using such a relationship can make the overlapping area generate more response pixels, as shown in the figure below (the last line is the response area obtained by the temporal-spatial method):
Pose constraints between multi-cameras
Since the multi-cameras in the algorithm predict the pose independently, but these cameras are indeed in the same motion system, their pre-calibrated external parameter transformation relationship can still construct their constraint relationship. Then, for two adjacent cameras, their corresponding constraints in timing and space are:
X ˉ it → t + 1 = X j − 1 X i X ^ it → t + 1 X i − 1 X j \bar{X} _i^{t\rightarrow t+1}=X_j^{-1}X_i\hat{X}_i^{t\rightarrow t+1}X_i^{-1}X_jXˉit→t+1=Xj−1XiX^it→t+1Xi−1Xj
The above formula establishes the transformation relationship of different cameras in time sequence and space, but it should be noted that the above formula is problematic . Its principle should be based on the predicted poses of the surrounding cameras, and transform the predicted poses to the target camera through the calibrated external parameters, so as to constrain the pose estimation results and transformed poses of the target camera itself, that is, from The two components of translation and rotation make them approximate:
tloss = ∑ j = 2 N ∣ ∣ t ^ 1 t + 1 − t ˉ jt + 1 ∣ ∣ 2 t_{loss}=\sum_{j=2}^N ||\hat{t}_1^{t+1}-\bar{t}_j^{t+1}||^2tloss=j=2∑N∣∣t^1t+1−tˉjt+1∣∣2
rotation 分量上(rotation):
R loss = ∑ j = 2 N ∣ ∣ ϕ ^ 1 − ϕ ˉ 1 ∣ ∣ 2 + ∣ ∣ θ ^ 1 − θ ˉ 1 ∣ ∣ 2 + ∣ ∣ Φ ^ 1 − Φ ˉ 1 ∣ ∣ 2 R_{loss}=\sum_{j=2}^N||\hat{\phi}_1-\bar{\phi}_1||^2+||\hat{\theta}_1 -\bar{\theta}_1||^2+||\hat{\Phi}_1-\bar{\Phi}_1||^2Rloss=j=2∑N∣∣ϕ^1−ϕˉ1∣∣2+∣∣i^1−iˉ1∣∣2+∣∣Phi^1−Phiˉ1∣∣2
Mask for loss calculation
Two types of masks are used in loss calculation: non-overlapping and self-occlusion. For the first mask is determined according to the effective area of the reconstruction error, the photometric reconstruction error will be obtained under the guidance of the mask under the temporal (different timing of the same camera) and spatial (same timing of different cameras).
For the second type, it is determined by the location of the device itself, and those parts are excluded in the calculation process, and the effect is shown in the figure below:
Experimental results
KITTI dataset:
DDAD dataset: