《Castle in the Sky:Dynamic Sky Replacement and Harmonization in Videos》

Reference code: SkyAR

1 Overview

This article provides a set of methods to automatically replace the sky, without the need to use auxiliary information such as inertial sensors, only need to provide video data to complete the sky segmentation and background replacement fusion , the final result quality is quite good.

The overall pipeline of the article can be roughly divided into the following 3 parts:

  • 1) Sky matting: Refer to the idea of ​​matting to separate the sky from the background to obtain high-quality sky alpha prediction results. And matting-based segmentation can bring over softening compared to traditional segmentation, so that the result after matting appears more natural after fusion;
  • 2) Motion estimator: The article calculates the affine transformation matrix of the sky in translation/rotation/scaling degrees of freedom through optical flow and key points, so that the motion information can be obtained without using inertial sensors;
  • 3) Skybox: After obtaining the motion information, you need to intercept the background image at the corresponding position, and then merge it through the recolor and relight methods given in the article;

The method of the article is still quite shocking visually, the following is a partial rendering: the
Insert picture description here
speed of the method of the article and the time consumption of each component:
Insert picture description here

2. Method design

2.1 Overall pipline

The overall pipeline of the article is shown in the figure below:
Insert picture description here

2.2 Sky Matting

The matting method of the article is to use ResNet-50 as the backbone, and then build a U-shaped codec network structure on this basis, and replace the traditional convolution with coordinate convolution so that the associated coordinate position information is more suitable for the distribution characteristics of the sky ( The sky is generally on the upper part of the picture, normalized on the y-axis), and then a better matting effect is obtained. The following shows the effect of this type of convolution on the final result:
Insert picture description here
Its network structure is shown in Table 1:
Insert picture description here
For the supervision of matting, the article uses the L2 norm, and the article defines the loss function as:
L f (I l) = EI l ∈ D l {1 N l ∣ ∣ f (I l) − A l ^ ∣ ∣ 2 2} L_f(I_l)=E_{I_l\in D_l}\{\frac{1}{N_l}||f( I_l)-\hat{A_l}||_2^2\}Lf(Il)=EIlDl{ Nl1f(Il)Al^22}
After the above decoder obtains the matting result of the sky, post-processing of the matting mask is required. In the article, a guided filter is used to optimize the matting result under the guidance of the original input. The characteristics of the sky in the article here only retain the information of the blue channel (say for better color contrast? There is a certain reason but the limit is too dead). Therefore, the optimized alpha graph can be described as:
A = fgf (h (A l), I, r, ϵ) A=f_{gf}(h(A_l),I,r,\epsilon)A=fgf(h(Al),I,r,ϵ )
Among them, the two hyperparameters of the guided filtering (radius and regularization coefficient) are set as:r = 20, ϵ = 0.01 r=20,\epsilon=0.01r=20,ϵ=0.01

2.3 Motion Estimation

The motion estimation here refers to the motion estimation of the picture. The article uses the Lucas-Kanade method on the image pyramid to obtain optical flow information, and then tracks these sparse feature points to obtain the transformation matrix information between frames (the article will move Described as the transformation of translation/rotation/scaling dimensions, where the transformation relationship is obtained using the RANSAC method). The article pays more attention to the changes of the sky, so when calculating, only focus on the characteristic points of the sky area. When the number of effective feature points in the sky is insufficient, the depth estimation method will be used to obtain the affine transformation matrix from the distant area.

In the actual process, it is found that the translation and rotation of the data of two adjacent frames are the more main forms of change, and in the kernel density estimation (the kernel is the Gaussian kernel), the Euclidean distance is used to exclude those feature points that change too little, P (d ) <η = 0.1 P(d)\lt\eta=0.1P(d)<the=0 . . 1 , so that may be made to the above-described transformation process improvement, and thus the obtained time-varying changes matrices:
M ^ (T) = M (C) ⋅ (M (T) ⋅ M (T -. 1) ... M ( 1)) \hat{M}^{(t)}=M^{(c)}\cdot(M^{(t)}\cdot M^{(t-1)}\dots M^{(1 )})M^(t)=M(c)(M(t)M(t1)M( 1 ) )
Among them,M (c) M^{(c)}M( c ) is the center crop parameter of skybox.

2.4 Image Blending

In general, the simplest fusion method uses the alpha channel weighted form ( A is the original image, B is the background ):
Y (t) = (1 − A (t)) I (t) + A (t) B (t) Y^{(t)}=(1-A^{(t)})I^{(t)}+A^{(t)}B^{(t)}AND(t)=(1A(t))I(t)+A(t)B( t )
But this will bring a more unfriendly visual effect, so the article proposes some improvements (note that thesky is the foreground alpha tends to 1, and the background tends to 0), so the article recolor the original input image (corresponding to the parameterα \alphaα )relight (corresponding to parameterβ \betaβ ), the new original picture transformation can be described as:
I ^ (t) = I (t) + α (μ B (A = 1) (t) − μ I (A = 0) (t)) \ hat{I}^{(t)}=I^{(t)}+\alpha(\mu_{B(A=1)}^{(t)}-\mu_{I(A=0)}^ {(t)})I^(t)=I(t)+α ( μB(A=1)(t)μI(A=0)(t))
I ( t ) = β ( I ^ ( t ) + μ I ( t ) − μ ^ I ( t ) ) I^{(t)}=\beta(\hat{I}^{(t)}+\mu_I^{(t)}-\hat{\mu}_I^{(t)}) I(t)=β (I^(t)+μI(t)μ^I(t))
Whereμ \muμ represents the mean operation of the corresponding image. The following figure shows the result obtained by using this fusion method:
Insert picture description here

3. Experimental results

Insert picture description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/112855469
sky