(2017, AdaIN) Real-time Arbitrary Style Transfer with Adaptive Instance Normalization

Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

Official account: EDPJ

Table of contents

​​​​​​​0. Summary

1 Introduction

2. Related work

3. Background

3.1 Batch Normalization (Batch Normalization, BN)

3.2 Instance Normalization (IN) 

3.3 Conditional Instance Normalization (CIN)

4. Explain instance normalization

5. Adaptive Instance Normalization (AdaIN)

6. Experimental setup

6.1 Structure

6.2 Training

7. Results

7.1 Comparison with other methods

7.2 Additional experiments

7.3 Real-time control

8. Discussion and conclusion 

appendix

4. The effect of using AdaIN in different layers

reference

S. Summary

S.1 Main idea

S.2 ADAIN

S.3 Structure and the effect of using AdaIN in different layers 


0. Summary

Gatys et al. [16] recently introduced a neural algorithm that renders the content of an image into the style of another image, enabling so-called style transfer. However, their framework requires a slow iterative optimization process, which limits its practical applications. Fast approximations using feed-forward neural networks have been proposed to accelerate neural style transfer. Unfortunately, this increased speed comes at a price: networks are usually bound to a fixed set of styles and cannot adapt to arbitrary new styles. In this paper, we propose a simple yet effective method that enables real-time arbitrary style transfer for the first time. At the heart of our approach is a novel Adaptive Instance Normalization (AdaIN) layer that aligns the mean and variance of content features with those of style features. Our method achieves a speed comparable to the fastest existing methods and is not limited by the set of predefined styles. Furthermore, our approach allows flexible user control, e.g., content/style trade-offs, style interpolation, color and spatial control, all using a single feed-forward neural network.

1 Introduction

In this work, we propose the first neural style transfer algorithm that addresses this fundamental flexibility/speed dilemma. Our method can transfer arbitrary new styles in real-time, combining the flexibility of an optimization-based framework [16] with a speed similar to the fastest feed-forward methods. Our approach is inspired by Instance Normalization (IN) layers, which are surprisingly effective in feed-forward style transfer. To explain the success of instance normalization, we propose a new explanation that instance normalization performs style normalization by normalizing feature statistics, which have been found to carry style information of images. Inspired by our explanation, we introduce a simple extension of IN, Adaptive Instance Normalization (AdaIN). Given content and style, AdaIN simply adjusts the mean and variance of the content image to match the mean and variance of the style image. Through experiments, we find that AdaIN effectively combines the content of the former and the style of the latter by transferring feature statistics. The decoder network is then learned to generate the final image by inverting the AdaIN output back into the image space. Our method is nearly three orders of magnitude faster than [16] without sacrificing the flexibility to migrate the input to arbitrary new styles. Furthermore, our method provides rich user control at runtime without any modifications to the training process.

2. Related work

Style transfer . The style transfer problem originates from non-photorealistic rendering and is closely related to texture synthesis and transfer. Some early methods included histogram matching of linear filter responses and nonparametric sampling. These methods often rely on low-level statistics and often fail to capture semantic structure. For the first time, Gatys et al. [16] demonstrated impressive style transfer results by matching feature statistics in the convolutional layers of a DNN. Recently, several improvements over [16] have been proposed.

  • Li and Wand introduced a Markov random field (MRF) based framework to enforce local patterns in deep feature spaces.
  • Gatys et al. proposed methods to control the scale of color preservation, spatial location and style transfer.
  • Ruder et al. improved the quality of video style transfer by imposing temporal constraints.

The framework of Gatys [16] et al. is based on a slow optimization process that iteratively updates images to minimize content loss and style loss computed by a loss network. Even with modern GPUs, it can take several minutes to converge. As a result, on-device processing in mobile apps is too slow to be practical.

  • A common solution is to replace the optimization process with a feed-forward neural network trained to minimize the same objective. These feed-forward migration methods are approximately three orders of magnitude faster than optimization-based alternatives, opening the door for real-time applications.
  • The granularity of feed-forward transfer is enhanced by a multi-resolution architecture in Wang et al.
  • Ulyanov et al. propose methods to improve the quality and diversity of generated samples.
  • However, the aforementioned feed-forward methods are bound to a fixed style in each network.
  • To address this problem, Dumoulin et al. introduced a network capable of encoding 32 styles and their interpolation.
  • Concurrent with our work, Li et al. proposed a feed-forward architecture that can synthesize up to 300 textures and transfer 16 styles.
  • Still, the above two methods are still unable to adapt to arbitrary styles not observed during training.

Recently, Chen and Schmidt introduced a feed-forward approach that transfers arbitrary styles through a style exchange layer. Given the feature activations of the content and style images, the style exchange layer replaces the content features with the best matching style features in a patch-by-patch manner. However, their style exchange layer creates a new computational bottleneck: more than 95% of the computation is spent on style exchange for 512 × 512 input images. Our method also allows arbitrary style transfer while being 1-2 orders of magnitude faster than Chen and Schmidt.

Another central issue in style transfer is which style loss function to use. The original framework of Gatys et al. [16] matches styles by matching second-order statistics between feature activations, captured by the Gram matrix. Other effective loss functions have been proposed, such as MRF loss, adversarial loss, histogram loss, CORAL loss, MMD loss, and distance between channel means and variances. Note that all of the above loss functions aim to match some feature statistics between the styled image and the synthesized image.

Deep generative image modeling . There are several alternative frameworks for image generation, including variational autoencoders, autoregressive models, and generative adversarial networks (GANs). Notably, GANs have achieved the most impressive visual quality. Various improvements to the GAN framework have been proposed, such as conditional generation, multi-stage processing, and better training objectives. GANs have also been applied to style transfer and cross-domain image generation.

3. Background

3.1 Batch Normalization (Batch Normalization, BN)

The seminal work of Ioffe and Szegedy introduced batch normalization (BN) layers, which significantly simplifies the training of feed-forward networks by normalizing feature statistics. BN layers were originally designed to speed up the training of discriminative networks, but have also been found to be effective in generative image modeling. Given an input batch x ∈ R^(N×C×H×W), BN normalizes the mean and standard deviation of each individual feature channel:

where γ, β ∈ R^C are the affine parameters learned from the data; μ(x), σ(x) ∈ R^C are the mean and standard deviation, and the batch size and spatial dimension are calculated independently for each feature channel: 

BN uses mini-batch statistics at training time and replaces them with population statistics at inference time, thus introducing a difference between training and inference.

  • Batch renormalization was recently proposed to address this problem by gradually using the statistics of the population during training.
  • As another interesting application of BN, Li et al. found that BN can mitigate domain shift by recomputing population statistics in the target domain.
  • Recently, several alternative normalization schemes have been proposed to extend the effectiveness of BNs to recurrent architectures.

3.2 Instance Normalization (IN) 

In the original feed-forward stylization method, the style transfer network consists of a BN layer after each convolutional layer. Surprisingly, Ulyanov et al. found significant improvements simply by replacing BN layers with IN layers:

Unlike BN layers, here μ(x) and σ(x) are computed independently in spatial dimension for each channel and each sample:

Another difference is that IN layers are invariant at test time, while BN layers usually replace mini-batch statistics with population statistics. 

3.3 Conditional Instance Normalization (CIN)

Instead of learning a set of affine parameters γ and β, Dumoulin et al. propose a conditional instance normalization (CIN) layer that learns a different set of parameters γ^s and β^s for each style s:

During training, style images and their indices s are randomly selected from a set of fixed styles s ∈ {1, 2, ..., S} (S = 32 in their experiments). Then, the content is processed by the style transfer network, where the corresponding γ^s and β^s are used for the CIN layer. Surprisingly, the network can generate completely different styles of images by using the same convolutional parameters but different affine parameters in the IN layer. 

Compared to networks without normalization layers, networks with CIN layers require 2FS extra parameters, where F is the total number of feature maps in the network. Since the number of additional parameters scales linearly with the number of styles, it is challenging to scale their method to model a large number of styles (e.g., tens of thousands). Furthermore, their method cannot adapt to arbitrary new styles without retraining the network. 

4. Explain instance normalization

Despite the great success of (conditional) instance normalizations, the reasons why they are particularly effective at style transfer remain elusive. Ulyanov et al. attribute the success of IN to its invariance to image content contrast. However, IN occurs in feature space, so it should have a more profound impact than simple contrast normalization in pixel space. Perhaps more surprisingly, the affine parameter in IN can completely change the style of the output image.

It is well known that the convolutional feature statistics of DNNs can capture the style of images. While Gatys et al. [16] used second-order statistics as their optimization objective, Li et al. recently showed that matching many other statistics, including channel means and variances, is also effective for style transfer. Inspired by these observations, we consider instance normalization to perform a form of style normalization by normalizing feature statistics (i.e. mean and variance). Although DNNs are used as image descriptors in [16], we argue that the feature statistics of the generator network can also control the style of generated images.

We run the code for a modified texture network to perform single-style transfer, with either IN or BN layers. As expected, the model with IN converges faster than the BN model (Fig. 1(a)). To test the interpretation in the improved texture network, we then normalize all training images to the same contrast by performing histogram equalization on the luma channel. As shown in Figure 1 (b), IN still works, indicating that the interpretation in the improved texture network is incomplete. To test our hypothesis, we normalize all training images to the same style (different from the target style) using a pretrained style transfer network. According to Fig. 1(c), when the image has been normalized by style, the improvement brought by IN becomes very small. The remaining gap can be explained by imperfect style normalization. Furthermore, models trained with BN on style-normalized images can converge as fast as models trained with IN on original images. Our results show that IN does perform a style normalization.

Many samples rather than a single sample can be intuitively understood as normalizing a batch of samples to center around a single style. However, each sample may still have a different style. This is undesirable when we want to transfer all images to the same style, as is the case in the original feed-forward style transfer algorithm. Although convolutional layers may learn to compensate for intra-batch style differences, it introduces additional challenges to training. On the other hand, IN can normalize the style of each individual sample to the target style. Training is convenient because the rest of the network can focus on content manipulation while discarding raw style information. The reason behind the success of CIN also becomes clear: different affine parameters can normalize the feature statistics to different values, and thus normalize the output image to different styles. 

5. Adaptive Instance Normalization (AdaIN)

If IN normalizes the input to a single style specified by the Affine parameter, can it be adapted to any given style by using an adaptive affine transformation? Here, we propose a simple extension to IN, which we call Adaptive Instance Normalization (AdaIN). AdaIN takes content x and style y, and simply aligns the channel mean and variance of x and y. Unlike BN, IN or CIN, AdaIN has no learnable affine parameters. Instead, it computes the affine parameters adaptively based on the style input:

where we simply scale the normalized content by σ(y) and offset it by μ(y). Similar to IN, these statistics are computed across spatial locations. 

Intuitively, let us consider a feature channel that detects style-specific strokes. Styled images with such strokes will produce high average activations for this feature. The output of AdaIN will have the same high average activation for this feature while preserving the spatial structure of the content image. The stroke features can be inverted to image space using a feed-forward decoder. The variance of this feature channel can encode more subtle style information, which is also passed to the AdaIN output and the final output image.

In short, AdaIN performs style transfer in feature space by transferring feature statistics, especially channel means and variances. Our AdaIN layer plays a similar role to the style exchange layer proposed in [6]. Although the style exchange operation is very time- and memory-intensive, our AdaIN layer is as simple as the IN layer with little added computational cost.

6. Experimental setup

6.1 Structure

Our style transfer network T takes as input a content image c and an arbitrary style image s, and synthesizes an output image that recombines the content of the former and the style of the latter. We adopt a simple encoder-decoder architecture where the encoder f is fixed to the first few layers of pretrained VGG-19 (until relu4_1). After encoding the content and style images in the feature space, we feed both feature maps to the AdaIN layer, which aligns the mean and variance of the content and style feature maps, resulting in the target feature map t:

Train a randomly initialized decoder g to map t back into image space, producing stylized images T(c, s): 

The decoder is mostly a mirror image of the encoder, and all pooling layers are replaced by recent upsampling to reduce checkerboard effects. We use reflection padding in both f and g to avoid boundary artifacts. Another important architectural choice is whether the decoder should use IN, BN or no normalization. As discussed in Section 2, IN normalizes each sample to a single style, while BN normalizes a batch of samples to center on a single style. Neither is desirable when we want the decoder to generate images in drastically different styles. Therefore, we do not use a normalization layer in the decoder. In Section 7.1 we show that IN/BN layers in the decoder do affect performance. 

6.2 Training

Following the setup of [6], we train our network using MS-COCO as content images and a dataset of paintings mainly collected from WikiArt as style images. Each dataset contains about 80,000 training examples. We use the adam optimizer, and 1 batch consists of 8 content-style image pairs. During training, we first resize the minimum size of the two images to 512 while maintaining the aspect ratio, and then randomly crop regions of size 256 × 256. Since our network is fully convolutional, it can be applied to images of any size during testing.

We use the pretrained VGG-19 to compute the loss function to train the decoder:

This is a weighted combination of content loss L_c and style loss L_s with style loss weight λ. Content loss is the Euclidean distance between target features and output image features. We use the AdaIN output t as the content target instead of the usual feature responses for content images. We find that this leads to slightly faster convergence, and also meets our goal of inverting the AdaIN output t. 

Since our AdaIN layer only transfers the mean and standard deviation of style features, our style loss only matches these statistics. While we find that the commonly used Gram matrix loss can produce similar results, we match the IN statistic because it is conceptually clearer. This style loss was also explored by Li et al. 

where each φ_i represents the layer in VGG-19 used to compute the style loss. In our experiments, we use relu1_1, relu2_1, relu3_1, relu4_1 layers with the same weights. 

7. Results

In this subsection, we compare our method with three style transfer methods:

  • Flexible but slow optimization-based methods, Gatys [16],
  • fast feed-forward methods limited to a single style, Ulyanov [52],
  • A Flexible Patch-Based Medium-Speed ​​Method, Chen and M. Schmidt [6].

If not stated otherwise, the results of the compared methods were obtained by running their code with the default configuration. For [6], we use the pretrained inverse network provided by the authors. All test images are of size 512×512.

7.1 Comparison with other methods

qualitative results . In Fig. 4, we show example style transfer results generated by the comparison method.

  • Note that all test style images are never observed during our model training, while Ulyanov's results are obtained by fitting a network to each test style.
  • Even so, for many images (eg, rows 1, 2, 3), the quality of our stylized images is quite competitive with that of Ulyanov and Gatys.
  • In some other cases (eg, row 5), our method is slightly behind the quality of Ulyanov and Gatys. This is not surprising, as we believe there are three trade-offs between speed, flexibility, and quality.
  • Compared to Chen and M. Schmidt, our method seems to transfer styles more faithfully for most comparison images.
  • The last example clearly illustrates a major limitation of Chen and M. Schmidt's attempt to match each content patch with the best matching style patch. However, style transfer fails if most content patches are matched with few style patches that do not represent the target style.
  • Therefore, we consider matching global feature statistics to be a more general solution, although in some cases (eg row 3) Chen and M. Schmidt's method can also produce attractive results. 

Quantitative evaluation . Does our algorithm sacrifice some quality for greater speed and flexibility, and if so, how much? To quantitatively answer this question, we compare our method with optimization-based methods (Gatys) and fast single style transfer methods (Ulyanov) in terms of content and style loss. Because our method uses a style loss based on IN statistics, we also modified the loss functions in (Gatys) and (Ulyanov) accordingly for a fair comparison (their results in Fig. 4 are still obtained using the default Gram matrix loss of). The content loss shown here is the same as in (Ulyanov, Gatys). The reported numbers are the average of 10 style images and 50 content images randomly selected from the WikiArt dataset and MS-COCO's test set.

As shown in Figure 3, the average content and style loss of our synthetic images is slightly higher, but comparable to the single style transfer method of Ulyanov et al. In particular, our method and Ulyanov achieve Gatys-like style loss between 50 and 100 optimization iterations. This demonstrates the strong generalization ability of our method, considering that our network never sees the test pattern during training, whereas each network in Gatys is trained exclusively on the test pattern. Also note that our style loss is much smaller than that of the original content image. 

speed analysis . Most of our computation is spent on content encoding, style encoding, and decoding, each taking about a third of the time. In some application scenarios such as video processing, the style image only needs to be encoded once, and AdaIN can use the stored style statistics to process all subsequent images. In some other cases (for example, transforming the same content into different styles), the computation spent on content encoding can be shared.

In Table 1, we compare the speed of our method with previous methods. Excluding style encoding time, our algorithm runs at 56 and 15 FPS for 256 × 256 and 512 × 512 images, respectively, allowing arbitrary user-uploaded styles to be processed in real-time. Among algorithms applicable to arbitrary styles, our method is nearly 3 orders of magnitude faster than (Gatys) and 1–2 orders of magnitude faster than (Chen and Schmidt). The speed improvement over (Chen and Schmidt) is especially important for higher resolution images, since the style exchange layer in (Chen and Schmidt) does not scale well to high resolution style images. Furthermore, our method achieves comparable speed to feed-forward methods limited to a few styles (Ulyanov, Dumoulin). The slightly longer processing time of our method is mainly due to our larger VGG-based network rather than methodological limitations. With a more efficient architecture, our speed can be further improved.

7.2 Additional experiments

In this subsection, we conduct experiments to justify our important architectural choices. We denote the method described in Section 6 as Enc-AdaIN-Dec. We experimented with a model called Enc-Concat-Dec, replacing AdaIN with concatenation, a natural baseline strategy for combining information from content and style images. Furthermore, we run the model with BN/IN layers in the decoder, denoted Enc-AdaIN-BNDec and Enc-AdaIN-INDec respectively. Other training settings remain unchanged.

In Figures 5 and 6 we show examples and training curves for different methods. In the images generated by the Enc-Concat-Dec baseline (Fig. 5(d)), the object contours of the style image can be clearly observed, indicating that the network failed to separate the style information from the content of the style image. This is also consistent with Fig. 6 Consistent, where Enc-Concat-Dec can achieve low style loss but cannot reduce content loss. Models with BN/IN layers also obtain worse quality results and consistently higher losses. The results for the IN layer are particularly poor. This again validates our statement that IN layers tend to normalize the output to a single style and thus should be avoided when we want to generate images of different styles. 

7.3 Real-time control

To further highlight the flexibility of our method, we show that our style transfer network allows users to control the degree of stylization, interpolate between different styles, transfer styles while preserving color, and use different styles in different spatial regions. . Note that all these controls use the same web application at runtime only, without any modifications to the training process.

Content-style trade-offs . The degree of style transfer can be controlled during training by adjusting the style weight λ in Equation 11. Furthermore, our method allows content-style tradeoffs at test time by interpolating between feature maps provided to the decoder. Note that this is equivalent to interpolating between AdaIN's affine parameters.

When α = 0, the network tries to faithfully reconstruct the content image, and when α = 1 synthesizes the most stylized image.

As shown in Figure 7, a smooth transition between content similarity and style similarity can be observed by varying α (from 0 to 1). 

Style interpolation . To interpolate between a set of K style images s1, s2, ..., sK, the corresponding weights w1, w2, ..., wK such that

We similarly interpolate between feature maps (results are shown in Figure 8): 

Space and color control . Gatys et al. recently introduced user control over the spatial location of color information and style transfer, which can be easily incorporated into our framework. To preserve the color of the content image, we first match the color distribution of the style image to that of the content image, and then perform normal style transfer using the color-aligned style image as style input. Example results are shown in Figure 9. 

In Figure 10, we demonstrate that our method can transform different regions of content images into different styles. This is achieved by performing AdaIN separately on different regions in the content feature map using statistics from different style inputs, similar to but in a fully feed-forward fashion. Although our decoder is only trained on inputs with homogenous style, it generalizes naturally to inputs with different styles in different regions. 

8. Discussion and conclusion 

In this paper, we propose a simple Adaptive Instance Normalization (AdaIN) layer, which for the first time enables real-time arbitrary style transfer. In addition to fascinating applications, we believe this work sheds light on our understanding of general depth image representations.

It is interesting to consider the conceptual differences between our method and previous neural style transfer methods based on feature statistics. Gatys et al employ an optimization process to manipulate pixel values ​​to match feature statistics. In some papers, the optimization process is replaced by feed-forward neural networks. Still, the network is trained to modify pixel values ​​to indirectly match feature statistics. We take a very different approach by directly aligning the statistics in feature space in one go and then inverting the features back to pixel space.

Given the simplicity of our method, we believe there is still much room for improvement. In future work, we plan to explore more advanced network architectures, such as residual architectures or architectures with additional skip connections from the encoder. We also plan to investigate more complex training schemes such as incremental training. Furthermore, our AdaIN layer aligns only the most basic feature statistics (mean and variance). Replacing AdaIN with correlation alignment or histogram matching may further improve quality by transferring higher-order statistics. Another interesting direction is to apply AdaIN to texture synthesis. 

appendix

4. The effect of using AdaIN in different layers

Figure 2 shows the effect of implementing AdaIN with different layers. Using relu4_1 achieves better perceptual results than earlier layers. 

 

reference

[16] L. A. Gatys, A. S. Ecker, andM. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.

[52] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.

[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.

Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization[C]//Proceedings of the IEEE international conference on computer vision. 2017: 1501-1510.

S. Summary

S.1 Main idea

To explain the success of instance normalization, the authors propose a new explanation that instance normalization performs style normalization by normalizing feature statistics, which carry style information of images. Based on this, the author proposes Adaptive Instance Normalization (AdaIN). Given content and style, AdaIN only needs to adjust the mean and variance of the content image to match the mean and variance of the style image, so that the generated image has the content of the former and the style of the latter.

S.2 ADAIN

AdaIN is shown in Equation 8:

where x and y denote the content image and style image, respectively. μ(x) and σ(x) denote the mean and standard deviation of content images, and μ(y) and σ(y) denote the mean and standard deviation of style images. Since the feature statistics of the image carry the style information of the image, after eliminating the style information of the content image through normalization, and then using the feature statistics (style information) of the style image to perform affine transformation, the style transfer can be realized.

S.3 Structure and the effect of using AdaIN in different layers 

The network structure used in this paper and the effect of using AdaIN in different layers are shown in the above two figures.

Since AdaIN operates based on the statistics of image features (feature space), the later layers in the network can extract more accurate features. Based on the statistical values ​​of these precise features, the style of the content image can be more fully de-styled during instance normalization, thus achieving higher-quality style transfer.

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/131045731