Characterizing and Improving Stability in Neural Style Transfer

  This paper is a paper published by Stanford University on CVPR 2017. The main content is how to maintain the stability of the video style.

Original link: https://arxiv.org/abs/1705.02092


Summary

  Recent research on image style transfer has focused on improving synthetic image quality and algorithm speed. But the real-time approach is very choppy, with noticeable flickering when applied to video. In this paper, we characterize the instability of these methods by examining the solution set of the style transfer objective equation. The results show that the trace of the Gram matrix is ​​inversely related to the stability of the method. Based on this, we propose a recurrent convolutional network incorporating a time-continuous loss, which overcomes the instability of previous methods. Our network can be applied to any resolution, does not require optical flow during the testing phase, and can produce high-quality, time-continuous synthetic videos in real-time.

1 Introduction

  The artistic style transfer of an image is to combine the content of an image with the style of another image to synthesize a brand new image. This problem has recently been revisited using deep neural network techniques. Subsequent work has improved method speed, quality, and can model multiple styles with a single model.
  Recent image style transfer methods can be divided into two categories. Optimization-based methods solve an optimization problem for each synthetic image; this method provides high-quality results, but is time-consuming. Feedforward methods train a neural network to approximate the solution of an optimization problem; after training this method can be applied in real time. However, all of these methods are highly unstable and show visible flickering when applied to video, as shown in Figure 1. Ruder et al. extended optimization-based methods from images to videos. Their method produces high-quality synthetic videos, but is too slow for real-time applications.
  In this paper, our goal is to perform style transfer on video using a feed-forward method and produce high-quality results comparable to optimization methods. Recent style transfer methods use Gram matrix features to express image style: stylized images are synthesized by matching the Gram matrix of style images. We find that the trace of the style image Gram matrix is ​​closely related to the instability of the pixels. In particular, the Gram matrix matching solution set of the objective equation is a sphere whose radius is determined by the trace of the Gram matrix of the style image. Due to the non-convexity of this objective function, small changes in the content image will push the synthesized image towards a different solution of the Gram matrix matching objective function. If all solutions are close (small trace), then the composite images corresponding to different solutions are also similar (no instability). But if the solutions are far apart (large traces), then different solutions will result in very different composite images (highly unstable).
  Based on this insight, we propose a method that can greatly improve the stability of feed-forward style transfer methods and enable the synthesis of high-quality stylized videos. In particular, we use a recurrent convolutional network for video stylization, trained with a temporal consistency loss, which encourages the network to find the Gram matrix matching the objective function at each time step. Our contribution in this paper is divided into two parts:

  • First, we characterize the instability of recent style transfer methods by examining the solution space of the style transfer objective function, showing that the trace and stability of the Gram matrix of a style image are anticorrelated. Our description applies to all Gram matrix matching based neural style transfer methods.
  • Second, we propose a recurrent convolutional network for real-time video stylization, which overcomes the instability of previous methods. Inspired by Ruder et al ., we incorporate an optical flow based loss function that encourages the network to generate temporally consistent results. Our method combines the speed of feed-forward methods with the temporal stability of optimization-based methods, resulting in a 1000x faster video stylization without sacrificing quality.

2. Related work

  Texture synthesis. Texture synthesis is very close to style transfer, and its goal is to infer a generative process based on the input texture to further generate samples with the same texture. Early attempts in the field of computer vision to address the problem of texture synthesis can be grouped into two distinct classes of approaches: parametric and nonparametric. Parametric methods compute global statistics of the feature space from which images are directly sampled. A non-parametric approach is to estimate a local conditional probability density function and then synthesize pixels incrementally. This method resamples pixels or entire regions of the original texture.
  The parametric method is based on the Julesz feature description, and two images are said to have the same texture if they have similar statistical measures in the feature space. The work of Gatys and others is based on Portilla and Simoncelli, using the feature space of high-performance neural network, and using the Gram matrix as the total metric. Ulyanov et al. propose instance normalization and a new learning equation to address the perceptual quality issue in feed-forward texture synthesis, which encourages the generator to sample unbiasedly from the Julesz texture ensemble. Building on the texture transfer work, Chen and Schmidt propose a novel "style swap" based approach that swaps content image patches and best matching style activation patches to create output image activations. The swapped activations are then passed into the inverse network to generate stylized images. Their optimization method is much more stable than [22, 34], so it is especially suitable for application to video. Although their method generalizes well and is stable, it takes a few seconds to run and cannot be used for
  style transfer in real-time video.The work of Gatys et al. shows that feature representation using high-performance convolutional neural networks can generate high-quality images. The results of their optimization-based method are visually good, but computationally expensive. Johnson and Ulyanov et al. proposed a feed-forward network that is thousands of times faster than [12] and can be used to generate stylized images in real time. However, each style requires separate training of a feed-forward network, and the visual quality of the generated images is inferior to that of optimization-based methods. Dumoulin et al. propose to use conditional instance normalization layers to address this problem, allowing one network to learn multiple styles. Compared with [23, 43], this simple and effective model can learn arbitrarily different styles with fewer parameters without compromising speed and visual quality.
  Optical flow. The accurate estimation of optical flow is a hot issue in computer research, and has broad application prospects in practical applications. The classic method of optical flow estimation is the variational method proposed on the basis of Horn and Schunck. Convolutional neural networks (CNNs) have been shown to be comparable to current state-of-the-art optical flow detection algorithms. FlowNet introduces a new CNN structure for optical flow estimation, and has comparable performance to variational methods. A complete review of optical flow estimation is beyond the scope of this paper, interested readers are referred to [1, 33, 3].
  Style Transfer for Video. Traditional image art stylization is studied under the label of non-photorealistic rendering. Litwinowicz was the first to combine the actions of converting Impressionist brushstrokes into images and using optical flow to track pixels between video frames to generate an event-coherent output video sequence. Hay and Essal added more optical and spatial constraints to overcome the flickering of brushstrokes. Hertzmann enhanced the visual quality of images by using image painting techniques with multiple brush sizes and long curved strokes, [18] and later extended this work in a video.
  Ruder recently extended the optimization-based method in [12] by introducing optical flow-based constraints to enforce temporal consistency between adjacent frames. They also proposed a multi-pass algorithm to ensure long-term video consistency. The results of their algorithm are very good in terms of temporal consistency and visual quality per frame, but it takes several minutes to process one frame.

3. Stability of style transfer

3.1 Image style transfer

  We use the style transfer expression in [12], which we briefly review below. Style transfer is an image synthesis technique where we receive a content image ccc and a style imagesss as input. output imagepppStateSpecificEquation (1) L ( s , c , p ) = λ c L c ( p , c ) + λ s L ( p , s ) \mathcal{L}(s,c,p)=\ lambda_c \mathcal{L}_c(p,c)+\lambda_s\mathcal{L}(p,s) \tag{1}L(s,c,p)=lcLc(p,c)+lsL(p,s)( 1 ) whereL c \mathcal{L}_cLcand L s \mathcal{L}_sLsare content reconstruction loss and style reconstruction loss respectively; λ c \lambda_clcλ s \lambda_slsis a hyperparameter scalar controlling how important they are.
  Content and style reconstruction loss with convolutional neural network ϕ \phiFor the formal definition of ϕ , we use VGG-19 pre-trained on ImageNet. ϕ j ( x ) \phi_j(x)ϕj( x ) is the imagexxx is in networkjthj^{th}jt The activation output of the h layer, the shape isC j × H j × W j C_j\times H_j\times W_jCj×Hj×Wj. Given a set of content layers C \mathcal{C}C and style layerS \mathcal{S}S , content and style reconstruction loss are defined as follows:(2) L c ( p , c ) = ∑ j ∈ C 1 C j H j W j ∥ ϕ j ( p ) − ϕ j ( c ) ∥ 2 2 \mathcal {L}_c(p,c)=\sum_{j \in \mathcal{C}}\frac{1}{C_jH_jW_j}\|\phi_j(p)-\phi_j(c)\|^2_2 \tag{ 2}Lc(p,c)=jCCjHjWj1ϕj(p)ϕj(c)22(2) (3) L s ( p , s ) = ∑ j ∈ S 1 C j H j W j ∥ G ( ϕ j ( p ) ) − G ( ϕ j ( s ) ) ∥ F 2 \mathcal{L}_s(p,s)=\sum_{j \in \mathcal{S}}\frac{1}{C_jH_jW_j}\|G(\phi_j(p))-G(\phi_j(s))\|^2_F \tag{3} Ls(p,s)=jSCjHjWj1G ( ϕj(p))G ( ϕj(s))F2( 3 ) whereG ( ϕ j ( x ) ) G(\phi_j(x))G ( ϕj( x ) ) is thejjthC j × C j C_j\times C_j of j layer activation outputCj×CjGram matrix, G ( ϕ j ( x ) ) = Φ jx Φ jx TG(\phi_j(x))=\Phi_{jx}\Phi_{jx}^TG ( ϕj(x))=PhijxPhijxT, Φ jx \Phi_{jx}Phijxis a C j × H j W j C_j\times H_jW_jCj×HjWjmatrix whose columns are ϕ j ( x ) \phi_j(x)ϕjC j C_jof ( x )Cjdimension feature vector.
  Rather than forcing the pixels of the output image to match the content and style images, the content and style reconstruction loss encourages the generated image to match the high-level features of the content image and the feature relations of the style image.

3.2 Gram matrix and style stability

  As shown in Figure 1, the content image ccSmall changes in c may result in quite different style imagesppp . But we noticed that not all styles share this instability. Some styles, such as Composition XIV (see Figure 1), are quite unstable, others, such as The Great Wave (see Figure 9), are more stable.

To study how much the instability depends on the style image, we only consider the style loss for a single layer. Then the style transfer network minimizes the objective function (the following table jj is omitted for simplicityj ): (4) min ⁡ G ( ϕ ( p ) ) 1 CHW ∥ G ( ϕ ( o ) ) − G ( ϕ ( s ) ) ∥ F 2 min ⁡ Φ p ∥ Φ p Φ p T − Φ s Φ s T ∥ F 2 \begin{aligned} \min_{G(\phi(p))}&\frac{1}{CHW}\|G(\phi(o))-G(\phi(s) )\|_F^2 \\ \min_{\Phi_p}&\|\Phi_p\Phi_p^T - \Phi_s\Phi_s^T\|^2_F \end{aligned}\tag{4}G ( ϕ ( p ) )minPhipminCHW1G ( ϕ ( o ) )G ( ϕ ( s ) ) F2ΦpPhipTPhisPhisTF2( 4 ) As an inspiration, first consider the simpleC = H = WC=H=WC=H=In the case of W , equation (4) is simplified to( Φ p 2 − Φ s 2 ) 2 (\Phi_p^2-\Phi_s^2)^2( Fp2Phis2)2 , is a non-convex equation, the minimum value is atΦ p = ± Φ s \Phi_p=\pm \Phi_sPhip=±Φsobtained, as shown in Figure 2 (left). Similarly, when C = H = 1 , W = 2 C=H=1,W=2C=H=1,W=2 , as shown in Figure 2 (right), the minimum value is located at the radiusΦ s \Phi_sPhison the circle. In both cases the minimum is at a distance from the origin Φ s \Phi_sPhisdistance. This finding also holds for the general case:

  • Theorem 1. Let γ \gammaγ is the center as the origin, and the radius istr ( Φ s Φ s T ) 1 2 tr(\Phi_s\Phi_s^T)^{\frac{1}{2}}t r ( ΦsPhisT)21surface. Then if and only if Φ p ∈ γ \Phi_p \in \gammaPhipγ时,J ( Φ p ) = ∥ Φ p Φ p T − Φ s Φ s T ∥ F 2 J(\Phi_p)=\|\Phi_p\Phi_p^T-\Phi_s\Phi_s^T\|^2_FJ ( Fp)=ΦpPhipTPhisPhisTF2Take the minimum value.

  This conclusion implies that those Gram matrix traces tr ( Φ s Φ s T ) tr(\Phi_s\Phi_s^T)t r ( ΦsPhisT) larger styles will be more unstable, because astr ( Φ s Φ s T ) tr(\Phi_s\Phi_s^T)t r ( ΦsPhisT) with increasing style reconstruction loss solutions can become very different in the feature space.
  We empirically validate style transfer stability andtr ( Φ s Φ s T ) tr(\Phi_s\Phi_s^T)t r ( ΦsPhisT) ; the only differences between frames are small illumination changes and detector noise.

  We trained separate feed-forward style transfer models on the 12-style COCO dataset using the algorithm of [22], and then stylized each frame of the video dataset using these models. Due to the static nature of the input video, any differences in stylized frames are due to the instability of the style transfer model; we estimated the instability of each style using the mean squared error between adjacent style frames. We plot the instabilities of each style and the curves between the VGG-16 relu1_1and Gram matrix traces in Fig. 3. relu2_1These results clarify the correlation between style instability and track.

4. Approach: Stable Style Transfer

  In summary, feed-forward networks for real-time style transfer can produce unstable stylized videos when the Gram matrix trace is large. We now propose a feed-forward style transfer method that overcomes this problem and achieves the speed of [22] and the stability of [30] simultaneously.

4.1 Overall structure

  The input of our method is a content image sequence c 1 , … , c T c_1,\dots,c_Tc1,,cTand a style image sss , output a stylized image sequencep 1 , … , p T p_1,\dots,p_Tp1,,pT. Each input appears pt p_tptshould and ct c_tctshared content, and sss shares style, and withpt − 1 p_{t-1}pt1The appearance is similar. At each time step, the output image pt p_tptby the style transfer network f W f_WfWDetermine: pt = f W ( pt − 1 , ct ) p_t=f_W(p_{t-1}, c_t)pt=fW(pt1,ct) .
  Similar to [22, 34], wehave sss trained a network. At each time step, the network is trained to minimize the sum of 3 losses: (5) L ( W , c 1 : T , s ) = ∑ t = 1 T ( λ c L c ( pt , ct ) + λ s L s ( pt , s ) + λ t L t ( pt , pt − 1 ) ) \mathcal{L}(W,c_{1:T},s)=\sum_{t=1}^T(\lambda_c\ mathcal{L}_c(p_t,c_t)+\lambda_s\mathcal{L}_s(p_t,s)+\lambda_t\mathcal{L}_t(p_t,p_{t-1}))\tag{5}L(W,c1:T,s)=t=1T( lcLc(pt,ct)+lsLs(pt,s)+ltLt(pt,pt1))( 5 ) whereL c \mathcal{L}_cLcand L s \mathcal{L}_sLsare the content and style reconstruction losses mentioned in Section 3, respectively; L t \mathcal{L}_tLtIt is the time consistency loss, which prevents the output of the network from adjacent time steps to vary greatly. Scalar λ c , λ s , λ t \lambda_c,\lambda_s,\lambda_tlc,ls,ltis a hyperparameter that measures the importance of these three losses. Network f W f_WfWVideo sequence { c 1 : T } \{ c_{1:T}\}{ c1:T} to minimize Equation (5) by stochastic gradient descent training.

4.2 Style transfer network

  If our network is to generate temporally consistent output, frames cannot be processed independently; the network must have the ability to check its previous output to ensure consistency. Therefore our network takes the current content image ct c_tctAnd the stylized result of the previous frame pt − 1 p_{t-1}pt1As input to synthesize the result pt = f W ( pt − 1 , ct ) p_t=f_W(p_{t-1},c_t)pt=fW(pt1,ct) . As shown in Figure 4, the output of the network at each time step is used as the network input for the next time step. Thereforef W f_WfWis a recurrent convolutional network that must be trained through backpropagation through time.
  f W f_WfWThe two inputs of are concatenated together along the channel dimension. f W f_WfWContains 2 layers of downsampling (followed by several residual blocks), 2 layers of nearest neighbor upsampling and convolution. All convolutional layers follow instance normalization and ReLU non-linear activation.

4.3 Temporal Consistency Loss

  Our style transfer network can check its own previous output, but this structural change alone is not enough to guarantee temporally consistent results. Therefore, similar to Ruder et al., we add a temporal consistency loss L t \mathcal{L}_t to the style and content lossesLt, to encourage temporally stable results by penalizing the network when adjacent timestep outputs are very different.
  The simplest temporal consistency loss penalizes the per-pixel difference between output images: L t ( pt − 1 , pt ) = ∥ pt − 1 − pt ∥ 2 \mathcal{L}_t(p_{t-1} ,p_t)=\|p_{t-1}-p_t\|^2Lt(pt1,pt)=pt1pt2 . However, to generate high-quality stylized video sequences, we don't want the stylized video frames to be exactly the same between time steps; we want the strokes, lines, and colors in each stylized frame to follow the motion in the input video Migrate to subsequent frames in a consistent manner.
  To achieve this, our temporal consistency loss exploits optical flow to ensure that changes in output frames are consistent with changes in input frames. Specifically, letw = ( u , v ) w=(u,v)w=(u,v ) is the input framect − 1 c_{t-1}ct1and ct c_tctThe optical flow domain between. Perfect optical flow gives ct c_tctand ct − 1 c_{t-1}ct1The correlation of pixels between; we want pt p_tptand pt − 1 p_{t-1}pt1The corresponding pixels are matched. Thus the temporal consistency loss penalizes all pixel coordinates ( x , y ) (x,y)(x,y)的差异: (6) p t − 1 ( x , y ) − p t ( x + u ( x , y ) , y + v ( x , y ) ) p_{t-1}(x,y)-p_t(x+u(x,y),y+v(x,y))\tag{6} pt1(x,y)pt(x+u(x,y),y+v(x,y))( 6 ) First use the optical flow to distort the output framept p_tptget p ~ t \tilde{p}_tp~t, and then calculate p ~ t \tilde{p}_tp~tand pt − 1 p_{t-1}pt1Individual pixel differences are used to effectively achieve this difference. The use of bilinear interpolation makes this distortion differentiable.
  Due to the movement of the foreground object, ct − 1 c_{t-1}ct1Some pixels in ct c_t may be inctis occluded in ; similarly, ct − 1 c_{t-1}ct1Some pixels that are occluded in ct may be in ct c_tctmanifested in. So, if forced to p ~ t \tilde{p}_tp~tand pt − 1 p_{t-1}pt1All pixels use a temporal consistency loss, which can lead to unnatural artifacts at moving boundaries. Therefore we use the occlusion mask mmm to avoid using temporal consistency loss for occluded or unoccluded pixels, and finally our temporal consistency loss is: (7) L ( pt − 1 , pt ) = 1 HW ∥ mt ⊙ pt − 1 − mt ⊙ p ~ t ∥ F 2 \mathcal{L}(p_{t-1},p_t)=\frac{1}{HW}\|m_t\odot p_{t-1}-m_t \odot \tilde{p}_t\ |_F^2\tag{7}L(pt1,pt)=HW1mtpt1mtp~tF2(7)其中 m ( h , w ) ∈ [ 0 , 1 ] m(h,w) \in [0,1] m(h,w)[0,1 ] Take the value 1 in the occlusion and motion boundary area,⊙ \odot is dot product,H , WH,WH,W is the height and width of the input frame. There is a summary of the loss function in Figure 5.

  Computing this loss requires both optical flow and occlusion masks; however, because this loss is only applied during training, our method does not need to compute optical flow or occlusion masks during testing.

Guess you like

Origin blog.csdn.net/qq_16137569/article/details/84987822