[Paper Notes] SVDM: Single-View Diffusion Model for Pseudo-Stereo 3D Object Detection

Original link: https://arxiv.org/abs/2307.02270

1 Introduction

  Current methods for generating pseudo-sensor representations from monocular cameras rely on pre-trained depth estimation networks. These methods require depth labels to train the depth estimation network, and pseudo-stereo methods synthesize stereo images through forward deformation of images, which can lead to pixel artifacts, distortions, and holes in occluded areas. Furthermore, feature-level pseudo-stereogram generation is difficult to apply directly and has limited adaptability.
  So how to bypass depth estimation and design a perspective generator at the image level? Compared with GAN, the diffusion model has a simpler structure, fewer hyperparameters and simpler training steps, but there is currently no research on pseudo-view generation for 3D target detection.
  This paper designs a single view diffusion model (SVDM) for pseudo-view synthesis. SVDM assumes that the left view image is known, replaces Gaussian noise with left image pixels, and gradually diffuses right image pixels to the entire image. Due to the subtle parallax of stereoscopic images, good results can be produced using only a few steps. SVDM does not use depth ground truth and can be trained end-to-end.

3. Method

3.1 Preparation knowledge

3.1.a Stereo 3D detector

  It can be divided into 3 categories: models that only require stereo image training (such as Stereo R-CNN), models that require additional depth ground truth training (YOLOStereo3D), and models based on volume grids (such as LIGA-Stereo).

3.1.b Denoising diffusion probability model (DDPM)

  For details , see Introduction to Diffusion Model . The goal of DDPM is to optimize the lower confidence bound (ELBO). Most conditional diffusion models retain the diffusion process and add the condition yyDetermine the function of y : E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( xt , y , t ) ∥ 2 2 ] \mathbb{E}_{t,x_0,\epsilon}[\|\epsilon- \epsilon_\theta(x_t,y,t)\|_2^2]Et,x0, ϵ[ϵϵi(xt,y,t)22]   But sincep ( xt ∣ y ) p(x_t|y)p(xty ) does not appear explicitly in the training target, and it is difficult to ensure that the diffusion model can learn the desired conditional distribution.

3.2 Single view diffusion model

  This model treats the new view generation task as an image-to-image (I2I) conversion task based on a diffusion model. The method in this article is shown in the figure below, which includes three diffusion model methods: Gaussian noise operator, view image operator and one-step generation.
Insert image description here

3.2.a Gaussian noise operator

  In order to learn the transformation between two view domains, according to BBDM, this paper uses the Brownian bridge diffusion process instead of the DDPM method.
  The Brownian bridge process is a continuous-time stochastic model in which the probability distribution in the diffusion process is conditioned on the starting and ending states. Remember the starting state is x 0 ∼ qdata ( x 0 ) x_0\sim q_{data}(x_0)x0qdata(x0) , the terminal state isx T x_TxT, then the state distribution of the Brownian bridge diffusion process is q BB ( xt ∣ x 0 , y ) = N ( xt ; ( 1 − mt ) x 0 + mty , δ t I ) q_{BB}(x_t|x_0,y) =\mathcal{N}(x_t;(1-m_t)x_0+m_ty,\delta_tI)qBB(xtx0,y)=N(xt;(1mt)x0+mty,dtI)其中 m t = t / T m_t=t/T mt=t/T δ t \delta_t dtis the variance. In order to avoid excessive variance that makes training impossible, the following variance scheduling is used: δ t = s [ 1 − ( ( 1 − mt ) 2 + mt 2 ) ] = 2 s ( mt − mt 2 ) \delta_t=s[1-( (1-m_t)^2+m_t^2)]=2s(m_t-m_t^2)dt=s[1((1mt)2+mt2)]=2s(mtmt2) wheresss controls the diversity of samples and defaults to 1.
  The forward process is as follows: whent = 0 t=0t=0 ,mt = 0 m_t=0mt=0 , the mean value at this time isx 0 x_0x0, the variance is 0; when t = T t=Tt=When T ,mt = 1 m_t=1mt=1 , the mean value at this time isyyy , the variance is 0. The intermediate process is calculated as follows: xt = ( 1 − mt ) x 0 + mty + δ t ϵ x_t=(1-m_t)x_0+m_ty+\sqrt{\delta_t}\epsilonxt=(1mt)x0+mty+dt ϵ incrementϵ ∼ N ( 0 , I ) \epsilon\sim\mathcal{N}(0,I)ϵN(0,i ) . Uset − 1 t-1t1Replace ttin the above formulat , subtract the two equations to get the transition probability: q BB ( xt ∣ xt − 1 , y ) = N ( xt ; 1 − mt 1 − mt − 1 xt − 1 + ( mt − 1 − mt 1 − mt − 1 mt − 1 ) y , δ t ∣ t − 1 I ) q_{BB}(x_t|x_{t-1},y)=\mathcal{N}(x_t;\frac{1-m_t}{1-m_{ t-1}}x_{t-1}+(m_t-\frac{1-m_t}{1-m_{t-1}}m_{t-1})y,\delta_{t|t-1} I)qBB(xtxt1,y)=N(xt;1mt11mtxt1+(mt1mt11mtmt1)y,dtt1I)其中 δ t ∣ t − 1 = δ t − δ t − 1 ( 1 − m t ) 2 ( 1 − m t − 1 ) 2 \delta_{t|t-1}=\delta_t-\delta_{t-1}\frac{(1-m_t)^2}{(1-m_{t-1})^2} dtt1=dtdt1(1mt1)2(1mt)2  The reverse process starts from the known view and gradually obtains the distribution of the target view. i.e. based on xt x_txtPredict xt − 1 x_{t-1}xt1: p θ ( xt − 1 ∣ xt , y ) = N ( xt − 1 ; μ θ ( xt , t ) , δ ~ t I ) p_\theta(x_{t-1}|x_t,y)=\mathcal {N}(x_{t-1};\mu_\theta(x_t,t),\tilde{\delta}_tI)pi(xt1xt,y)=N(xt1;mi(xt,t),d~tI)其中 μ θ ( x t , t ) \mu_\theta(x_t,t) mi(xt,t ) is the mean value of the prediction noise, estimated by the neural network based on the maximum likelihood criterion. δ ~ t \tilde{\delta}_td~tis the variance of the noise at each step, the analytical form is δ ~ t = δ t ∣ t − 1 δ t − 1 δ t \tilde{\delta}_t=\frac{\delta_{t|t-1}\delta_{t -1}}{\delta_t}d~t=dtdtt1dt1.
  The complete training and inference process is as follows:

BBDM training algorithm

  1. Sampling data pair x 0 ∼ q ( x 0 ) , y ∼ q ( y ) x_0\sim q(x_0),y\sim q(y)x0q(x0),yq(y)
  2. Uniform sampling time t ∈ { 1 , 2 , ⋯ , T } t\in\{1,2,\cdots,T\}t{ 1,2,,T}
  3. Sampling Gaussian noise ϵ ∼ N ( 0 , I ) \epsilon\sim\mathcal{N}(0,I)ϵN(0,I)
  4. Forward diffusion: xt = ( 1 − mt ) x 0 + mty + δ t ϵ x_t=(1-m_t)x_0+m_ty+\sqrt{\delta_t}\epsilonxt=(1mt)x0+mty+dt ϵ
  5. Define mt ( y − x 0 ) + δ t ϵ − ϵ θ ( xt , t ) ∥ 2 \|m_t(y-x_0)+\sqrt{\delta_t}\epsilon-\epsilon_\theta(x_t,t); \|^2mt(yx0)+dt ϵϵi(xt,t)2 gradient.

BBDM sampling algorithm:

  1. Sampling condition input x T = y ∼ q ( y ) x_T=y\sim q(y)xT=yq(y)
  2. From t = T t=Tt=Starting from T , proceed with the following process untilt = 1 t=1t=1 :
      Calculationz 〜 N ( 0 , I ) z\sim\mathcal{N}(0,I)zN(0,I)
      计算 x t − 1 = c x t x t + c y t y − c ϵ t ϵ θ ( x t , t ) + δ ~ t z x_{t-1}=c_{xt}x_t+c_{yt}y-c_{\epsilon t}\epsilon_\theta(x_t,t)+\sqrt{\tilde{\delta}_t}z xt1=cxtxt+cytycϵtϵi(xt,t)+d~t z
  3. t = 1 t=1 t=1时,calculatex0 = cx 1 x 1 + cy 1 y − c ϵ 1 ϵ θ ( x 1 , 1 ) x_0=c_{x1}x_1+c_{y1}y-c_{\epsilon1}\epsilon_\theta (x_1,1)x0=cx 1x1+cy 1ycϵ 1ϵi(x1,1)

3.2.b View image manipulator

  The Brownian bridge diffusion model introduces additional hyperparameters. This paper proposes a method based on view image operators, which treats the target image as special noise and iteratively converts the target image into the source image. Given the initial state x 0 x_0x0and target state yyy , intermediate statext x_txtIt can be written as: xt = α tx 0 + 1 − α ty x_t=\sqrt{\alpha_t}x_0+\sqrt{1-\alpha_t}yxt=at x0+1at y is different from the conventional process of adding noise. What is added here is a new view image that gradually increases the weight. The sampling process is as follows:

  1. Input source image x T x_TxT
  2. From t = T t=Tt=Starting from T , proceed with the following process untilt = 0 t=0t=0
       x 0 ≤ f ( x t , t ) x_0\leq f(x_t,t) x0f(xt,t)
       x t − 1 = x t − D ( x 0 , t ) + D ( x 0 , t − 1 ) x_{t-1}=x_t-D(x_0,t)+D(x_0,t-1) xt1=xtD(x0,t)+D(x0,t1)

(Regarding the sampling algorithm of this method, the symbols used in the original article should be problematic and lack explanation. Here we can only guess the ss in the original article.s andiii should actually bettt)

   α t \alpha_t atThe schedule is as follows: α t = f ( t ) f ( 0 ) , f ( t ) = cos ⁡ ( t / T + s 1 + s ⋅ π 2 ) 2 \alpha_t=\frac{f(t)}{f (0)},f(t)=\cos(\frac{t/T+s}{1+s}\cdot\frac{\pi}{2})^2at=f(0)f(t),f(t)=cos(1+st/T+s2p)2 Compared to linear scheduling, cosine scheduling adds target views more slowly.

3.2.c Accelerated sampling and one-step generation

  Since diffusion probability models usually require a large number of step sampling, in order to speed up the inference process, this paper proposes two methods: one is to add a high-order solver to guide DPM sampling, and the other is to introduce a one-step generation method.
  Accelerated sampling : Similar to the basic idea of ​​DDIM, BBDM can also maintain the same edge distribution as the Markov inference process while using a non-Markov process.
  Given { 1 , 2 , ⋯ , T } \{1,2,\cdots,T\}{ 1,2,,The length of T } isSSSubsequence of S { T 1 , T 2 , ⋯ , TS } \{T_1,T_2,\cdots,T_S\}{ T1,T2,,TS} , the inference process can be determined by the subset of latent variables{ x T 1 , x T 2 , ⋯ , x TS } \{x_{T_1},x_{T_2},\cdots,x_{T_S}\}{ xT1,xT2,,xTS}定义: q B B ( x T s − 1 ∣ x T s , x 0 , y ) = N ( ( 1 − m T s − 1 ) x 0 + m T s − 1 + δ T s − 1 − σ T s 2 δ T s ( x T s − ( 1 − m T s ) x 0 − m T s y ) , σ T s 2 I ) q_{BB}(x_{T_{s-1}}|x_{T_s},x_0,y)=\mathcal{N}((1-m_{T_{s-1}})x_0+m_{T_{s-1}}+\frac{\sqrt{\delta_{T_{s-1}}-\sigma_{T_s}^2}}{\sqrt{\delta_{T_s}}}(x_{T_s}-(1-m_{T_s})x_0-m_{T_s}y),\sigma_{T_s}^2I) qBB(xTs1xTs,x0,y)=N((1mTs1)x0+mTs1+dTs dTs1pTs2 (xTs(1mTs)x0mTsy),pTs2I )
   One-step generation: The goal is to perform one-step generation without sacrificing the advantages of iterative refinement. These advantages include the ability to balance computation and quality, and the ability to edit zero-shot data. The method is based on the continuous-time diffusion model probabilistic flow ordinary differential equation (ODE), whose trajectory smoothly transforms from the data distribution to a tractable noise distribution. Use a model that learns to map points on any step to the starting point of a trajectory so that the model is self-consistent (i.e., points on the same trajectory will be mapped to the same starting point).
  The consistency model can combine the random noise vector (the end point of the ODE trajectory,x T x_TxT) into data samples (starting point of ODE trajectory, x 0 x_0x0). By connecting the output of the consistency model in multiple steps, more calculations can be used to improve sample quality and perform zero-sample data editing, thereby maintaining the advantages of iterative refinement.

3.3 Network structure

  According to the implicit diffusion model (LDM), SVDM performs generative learning in the latent space rather than the original pixel space to reduce computation.
  LDM uses pre-trained VAE encoder EEE maps the imagev ∈ R 3 × H × W v\in\mathbb{R}^{3\times H\times W}vR3 × H × W is encoded as an implicit embeddingz = E ( v ) ∈ R c × h × wz=E(v)\in\mathbb{R}^{c\times h\times w}z=E ( v )Rc × h × w . Its forward process gradually moves towardzzAdd noise to z , and reverse the denoising process to predictzzz . Finally, LDM uses the pre-trained VAE decoderDDD decodingzz_z , get a high-resolution imagev = D ( z ) v=D(z)v=D ( z ) . The encoder and decoder of VAE remain fixed during training and inference, and sinceh < H , w < W h<H,w<Wh<H,w<W , diffusion in low-resolution latent space is more efficient than diffusion in pixel space. The method in this article is similar, given the slave domainAAImage IA I_Asampled in AIA, first extract the latent features LA L_ALA, and then perform the SVDM process to convert LA L_ALAMap to the corresponding,domain BBImplicit expression LA in B → B L_{A\rightarrow B}LAB. Finally, the pre-trained VQGAN decoder is used to generate image IA → B I_{A\rightarrow B}IAB.
  The SVDM model connects two images along the channel dimension, and uses the standard U-Net structure and Conv-NeXt residual block for up and down sampling to achieve a large receptive field and obtain contextual information. In addition, attention blocks are also introduced at different resolutions because global interactions can significantly improve the reconstruction quality.

3.4 Loss function

  The loss function contains 3 items: RGB L1 loss, RGB SSIM loss and perceptual loss.

3.4.a RGB L1 loss and SSIM loss

  The L1 loss and SSIM loss are as follows: LL 1 = 1 3 HW ∑ ∣ I ^ tgt − I tgt ∣ L ssim = 1 − SSIM ( I ^ tgt , I tgt ) \mathcal{L}_{L1}=\frac{1 }{3HW}\sum|\hat{I}_{tgt}-I_{tgt}|\\\mathcal{L}_{ssim}=1-SSIM(\hat{I}_{tgt},I_{ tgt})LL 1=3HW _ _1I^tgtItgtLyes im=1SS I M (I^tgt,Itgt) whereI ^ tgt \hat{I}_{tgt}I^tgt I t g t I_{tgt} Itgtare the predicted values ​​and true values ​​of the pixel channels respectively.

3.4.b Perceived loss

  Building on past work, the perceptual loss ensures that the reconstruction is constrained to the image manifold by enforcing local authenticity and avoids the blur introduced by relying solely on RGB losses. L latent = 1 2 ∑ j = 1 J [ ( uj 2 + σ j 2 ) − 1 − log ⁡ σ j 2 ] \mathcal{L}_{latent}=\frac{1}{2}\sum_{j =1}^J[(u_j^2+\sigma_j^2)-1-\log\sigma_j^2]Llatent=21j=1J[(uj2+pj2)1logpj2]

4. Experiment

4.4 View synthesis results based on a single image

  Quantitative results : The method in this article can surpass SotA in terms of PSNR index, but the SSIM and LPIPS indicators are slightly lower than SotA.
  Qualitative results : Visualizations show that our method generates more realistic images with smaller distortions and artifacts. This demonstrates the ability of our method to model the geometry and texture of complex scenes.

4.5 3D target detection results

  Quantitative results : Experiments show that SVDM can outperform most advanced methods when using BBDM. Using the view diffusion method can further improve performance, which shows that the view structure has better generalization ability in 3D target detection.
  In addition, although it cannot completely surpass SotA, SVDM has better performance in the detection of difficult objects. The reason for poor performance on simple objects may be limited constraints. Both background and obstacles inevitably interfere with new view generation. The ConvNeXt-UNet structure can alleviate this problem to a certain extent, but it is not perfect.

4.3 Ablation studies

  3D detection results of pedestrians and bicycles : Due to the small number of samples, the detection of pedestrians and bicycles is more difficult than the detection of cars. But the method in this article can surpass SotA on almost all difficulties.

5. Conclusion and future prospects

  Currently, one drawback of SVDM is that it cannot be trained end-to-end.

Guess you like

Origin blog.csdn.net/weixin_45657478/article/details/133239735