StyleGAN-NADA: CLIP-guided non-adversarial domain adaptation (Domain Adaptation) image generator

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Official account: EDPJ

Table of contents

0. Summary

0.1 Explanation of keywords and terms

1 Introduction

2. Related work

3. Front technology

3.1 StyleGAN

3.2 StyleCLIP

4. Method

4.1 CLIP-based guidelines

4.2 Layer freezing

4.3 Hidden mapper mining "mining"

5. Experiment

5.1 Results

5.2 Latent Space Exploration 

5.3 Comparison with other methods

5.4 Ablation study

6 Conclusion

reference

appendix

A. Broader Impact

B. Analysis of the CLIP space

C. Few-shot CLIP Guidelines (Guidance)

D. Beyond StyleGAN

E. Identity Preservation

F. Cross-Model interpolation 

G. More samples (Additional samples)

H. Qualitative Few-shot Comparison

I. Training Details

J. Licenses and data privacy 


0. Summary

Is it possible to train a generative model that generates images from a specific domain, guided only by text, without looking at any images? In other words, is it possible to "blindly" train a generator? Using the semantic capabilities of large-scale Contrastive-Language-Image-Pretraining (CLIP) models, it is possible to transfer a generative model to a new domain (domain shift) based on text without any image reference. ). Based on text guidance and shorter training time, the generator can generate images in a variety of styles and shapes. It is worth noting that these changes are very difficult for existing methods. This paper conducts extended experiments in many cross-domains, proves the effectiveness of our method, and demonstrates the latent space structure of our model, which makes generative models more suitable for downstream tasks (down-stream tasks).

0.1 Explanation of keywords and terms

generative model

Cross-domain, Domain Adaptation, Domain Shift

Non-Adversarial Domain Adaptation

Contrastive-Language-Image-Pretraining (CLIP) models

  • Based on the goal of contrastive learning, both images and text are mapped to a joint multimodal embedding space (latent space, latent space)
  • The representation of images and text in it is called latent code
  • Image generation or modification by optimizing hidden coding
  • Each attribute of the image, such as age and gender, is linearly separable in the latent space, and there is a hyperplane to separate the two. Therefore, by moving according to the direction of the hyperplane of different attributes, one attribute can be edited and multiple attributes can be edited without interfering with each other.

Adaptive layer selection, or adaptive trainable weight selection

  • Principle: During model training, different layers in the deep network extract different features; conversely, when images are generated, different layers generate different features.
  • The essence of domain migration: Based on the difference between the source domain and the target domain, some key information of the source domain image is retained, and then the rest of the information is changed, so that the generated image meets the requirements of the target domain while retaining the key information of the source domain.
  • Freeze the layers that are more relevant to key information generation and train the remaining layers. See section 4.2 layer freezing for specific operation methods

Segmentation mask (segmentation mask, Appendix D )

  • It is similar to layer selection, the difference is that the mask directly operates on the image
  • Reserve the area specified by the image (the area that does not need to be changed), mask the remaining area, and then send it to CLIP to convert the mask area

1 Introduction

The unprecedented ability of Generative Adversarial Networks (GANs) to capture the distribution of modeled images through large semantic latent spaces has revolutionized countless fields: image enhancement, editing, and even more recently discriminative tasks such as , classification and regression).

Typically, these models are limited to domains where a large number of images can be collected. This severely limits the usefulness of the model. In fact, in many scenarios (a certain artist's painting, a rare medical condition, an imagined scene), there may not be enough or even no data to train a GAN.

More recently, visual-language models (CLIP) can encapsulate generic information, eliminating the need to collect data. Furthermore, the model can be combined with generative models to provide a simple, intuitive text-driven interface for image generation and manipulation. However, such works are built on pre-trained generative models in a fixed domain, and users can only generate and operate within the domain.

The method in this paper can be generated out of domain. The figure above shows three examples of out-of-domain generation. Three models were trained blindly without seeing any images from the target domain.

Text-guided training using CLIP is challenging. Naive approaches (requiring the generated images to maximize the CLIP-based classification score) often lead to adversarial results ( see Appendix B ). Instead, the authors encode domain differences as textual directions in the embedding space of CLIP. The author proposes a new loss and two generator training structures: one generator remains frozen, providing samples from the source domain (source domain); the other is optimized to generate images that are different from the source domain and only obey CLIP-space, cross-domain text description orientation.

For very drastic domain changes, in order to increase the stability of training, an adaptive training method is introduced: in each training iteration, CLIP is used to identify the most relevant network layers, and then only train these most relevant network layers.

The authors apply this method to StyleGAN2, demonstrating its effectiveness on a wide range of source and target domains, including: artistic styles, cross-species recognition transfer, and important shape changes (e.g., turning dogs into bears). Compared with existing editing techniques and few-shot methods, StyleGAN-NADA does not require any training data to do the same work.

Finally, the authors demonstrate that StyleGAN-NADA preserves the compelling structure of the latent space. The transferred generator not only preserves the original editing capabilities, it even reuses any existing editing guidelines and models trained for the original domain.

2. Related work

Text-guided synthesis . Visual language tasks include language-based image retrieval (retrieval), image captioning (captioning), visual question answering, and text-guided synthesis. To solve these tasks, cross-modal visual and language representations need to be learned, usually by training a transformer.

The recently emerged CLIP is a powerful model for joint vision-language representations. It is trained with 4 billion text-image pairs. Based on the goal of contrastive learning, both images and text are mapped to a joint multimodal embedding space. Representations learned based on CLIP are used in many tasks such as image synthesis and manipulation. These methods optimize latent codes to generate or manipulate a specific image. In contrast, this paper uses a new approach: only a piece of text is used to guide the training of the generator.

Train the generator with limited data . The goal of few-shot generative models is to simulate rich and diverse distributions with only a small number of targets. Methods for tackling these tasks can be divided into two broad categories: from-scratch and fine-tune (based on the diversity of pre-trained generators).

  • For the from-scratch approach, "few" usually means thousands of images (rather than thousands or millions). These methods usually perform data augmentations, or use auxiliary tasks to make the discriminator learn better based on existing data.
  • In the context of transfer learning, "few" usually refers to a range: from 5 to hundreds. When training with very little data, the primary concern is to avoid mode-collapse and overfitting to successfully transfer diverse source generators to the target domain. There are many approaches to address this challenge: some constrain weights that can be changed; others introduce new parameters to control channel-level statistics, sample at appropriate regions of latent space, add regularization terms, or force cross-domain align.

Where previous methods used a small amount of data for generator adaptation, our method does the same without any training data. Furthermore, previous methods fine-tune trainable weights. Our method introduces adaptive layer selection based on the state of the network and the class of the object trained at each step.

3. Front technology

The core of our method consists of two elements: StyleGAN2 and CLIP. Next, the relevant features of the StyleGAN structure will be discussed, and how CLIP has been used in the past.

3.1 StyleGAN

In recent years, StyleGAN and its variants are state-of-the-art unconditional image generators. The generator consists of two main parts. A mapping network: In a learned latent space W, the latent codeword z sampled from the Gaussian prior is transformed into a vector w. These latent vectors are then fed to the synthesis network to control the statistics of the features (or kernels). By traversing W, or mixing different w codewords at different network layers, fine-grained control over the semantic properties of the generated image.

However, these latent space editing operations are usually only in the domain of the initial training set.

3.2 StyleCLIP

In recent work, Patashnik et al. combined the generative power of StyleGAN with the semantic knowledge of CLIP to discover new editorial guidelines: only use textual descriptions of expected changes. They summarize three ways to use the semantic capabilities of CLIP.

The first two are to minimize the distance between the generated image and the target text in CLIP space. They use latent codeword optimization, or directly train an encoder (or mapper) to change the input latent codeword.

The third approach, and the one used by the authors, uses CLIP to determine global guidelines that account for changes in the latent space: modify latent space entries separately to identify which ones cause changes in the image space. These changes are consistent with the guidance between two textual descriptions (source and target) in CLIP space.

However, these methods share a restricted latent space editing approach: the changes that can be made are limited to the domain of the pretrained generator. Thus, they can change hairstyles, expressions, and even turn wolves into lions (if the generator has seen both), but cannot turn photos into paintings, or generate cats with a generator trained on dogs.

4. Method

Domain transfers are done with text guidance only (no images). Only pre-trained CLIP models are used for supervision.

This paper accomplishes the task through two questions:

How best to extract the semantic information encapsulated in CLIP?

How to regularize optimal processing to avoid adversarial solutions and mode collapse?

4.1 CLIP-based guidelines

Global loss . You can use StyleGAN's guided loss:

Where G(w) is the image generated by feeding the hidden codeword w to the generator G, \mathop t\nolimits_{​{\rm{t}}\arg {\rm{et}}}is the text description of the target category, and \mathop D\nolimits_{clip}is the cosine distance of the CLIP space. Loss is named "global" because it does not depend on the initial image or domain.

In fact, this loss leads to adversarial solutions. In the absence of a fixed generator that supports real image manifold solutions, the optimizer trains a classifier (CLIP) by adding pixel-level perturbations to images. Also, maintaining diversity does not yield better loss. In fact, a generator that only produces one image in mode collapse for a given distance from the text is probably the best optimizer. The analysis of CLIP's embedding space in Appendix B confirms this idea. These shortcomings make this loss unsuitable for training generators. However, the authors use it for adaptive layer selection.

Guided CLIP loss . To address these issues, the authors took inspiration from StyleCLIP's global guidance approach. Ideally, one would identify the bootstrapping between the source and target domains, and then fine-tune the generator based on the bootstrapping so that it generates images different from the source domain.

To implement this scheme, a cross-domain bootstrapping is first determined by embedding source and target domain (e.g., dog and cat) paired textual descriptions in the CLIP space. Then, the orientation of the images produced by the generator in CLIP space before and after fine-tuning must be determined. The author does this with two generators:

  • Start with a generator pre-trained on a single source domain (e.g. faces, dogs, churches, or cars) and then clone it.
  • One replica remains frozen during training. Its role is to provide an image of the source domain for each hidden codeword.
  • The second copy is used for training. Based on the guidance of the textual description, it is fine-tuned to produce images that are different from the source domain.
  • The two generators are named respectively \mathop G\nolimits_{frozen} \mathop {,G}\nolimits_{train}.

In order to maintain the alignment of the latent space, the two generators share the same mapping network that remains unchanged throughout the process. The complete training is shown in the figure above.

The figure above shows the guidance display of the CLIP space. The guide loss is given by:

Among them, \mathop E\nolimits_I ,\mathop E\nolimits_Tare image and text encoders. \mathop t\nolimits_{t\arg et} ,\mathop t\nolimits_{source}are the source and target category text.

Such a loss can overcome the shortcomings of the global loss:

  • First, the global loss is adversely affected by mode collapse. If the target generator can only produce one image, the guidance of the CLIP space from all sources to the target image will be different. Therefore, all text guides cannot be aligned.
  • Second, it is more difficult for the network to converge to adversarial solutions. Because it has to design perturbations to fool CLIP on sets with an infinite number of different instances.

4.2 Layer freezing

For text-based domain transfer, such as converting images to sketches, the training strategy described above converges quickly before mode collapse and overfitting occur. However, scalable shape changes require longer training time, which can make the network unstable, resulting in poor performance.

Previous work on few-shot domain adaptation found that training only a subset of network weights can greatly improve the quality of generated images. Since some layers of the source generator are useful for generating certain aspects of the target domain, they are preserved. Additionally, optimizing fewer parameters reduces model complexity and the risk of overfitting.

Based on these methods, the training process is regularized by limiting the number of weights that can be changed at each training iteration.

Ideally, weights that are more relevant to a given change would be considered trainable weights. To determine these weights, latent space editing techniques are reviewed, in particular, StyleCLIP.

In StyleGAN, codewords fed to different network layers affect different semantic features. Thus, by considering the editing guidance in W+ space (the latent space where different codewords are available for each layer of StyleGAN \mathop w\nolimits_i \in W), it is possible to determine which layer is more strongly associated with a given change. Based on this, the author proposes a training strategy, in each iteration:

  • Select the k most relevant layers;
  • In single-step training, only these layers are optimized, and other layers are frozen

For choosing k layers:

  • Randomly sample codewords from W \mathop N\nolimits_w, and then transform these codewords into W+ by replicating the same codewords to each layer.
  • Then, use the StyleCLIP hidden codeword optimization method (using the global loss, Equation 1) for the \mathop N\nolimit_isecond iteration.
  • Select k layers that change the hidden codeword most drastically.

The two-step process is shown in the figure above:

  • In the first step, a set of latent codewords is optimized in the W+ space (green part), while keeping all network weights constant. This optimization is performed with a global CLIP loss (Equation 1). Choose the layer corresponding to the most drastic change in w input.
  • In the second step, the selected layers are unfrozen and then optimized with the guided CLIP loss (Equation 2).

In all cases, we also freeze StyleGAN's mapping network, affine codeword transformation and all toRGB layers.

Note that the above processing is fundamentally different from gradient-based layer selection in direct training. The latent codeword optimization using a frozen generator is to support solutions that remain on the real image manifold. Using it for layer selection can make training consistent with real changes. Direct training, on the contrary, will make the model more prone to unrealistic or adversarial solutions.

4.3 Hidden mapper mining "mining"

For some shape changes, the generator cannot perform a complete conversion. For example, when turning a dog into a cat, fine-tune causes a new generator to be generated. This generator can produce cats, dogs, and everything in between. To solve this problem, the generator of label migration includes cats and dogs in its domain. Therefore, the question turns to domain implicit editing techniques. In particular, StyleCLIP's latent mapper maps all latent codewords to cat-like regions in latent space. The training details of the mapper are shown in Appendix I.

5. Experiment

5.1 Results

The text-guided transformation is shown in the previous two figures. Please refer to Appendix G for more information.

The transitions from dogs to various animals are shown in the image above. The specific training details are shown in Appendix I.

5.2 Latent Space Exploration 

Modern image generators (especially StyleGAN) are known for well-behaved latent spaces. The latent space is beneficial for tasks such as image editing, image-to-image translation. Operations on real images have led to a large number of GAN inversion techniques. The generator used in this article also operates with a similar technique. In fact, the model in this paper can reuse the existing pre-trained model in the source domain without additional fine-tune.

GAN inverse mapping (Inversion). Match the existing inverse mapping technique with the author's deformed generator, and start from this aspect. Given a real image, transform it with a ReStyle encoder (pretrained in the face domain), and then insert the transformed hidden codewords w \in W +into the warped generator. The result is shown in the figure above. The deformed generator successfully preserves the identity features associated with latent codewords, even for codewords obtained from real image inversions.

Latent traversal editing . The inverse mapping results show that the latent space of the deformed generator is aligned with that of the source generator. This is not surprising:

  • First, this is because of the author's complex generator structure and the characteristics of the guiding loss.
  • Second, this is because previous methods preserve the alignment of fine-tuned generators for downstream applications.

However, this paper uses a different, non-adversarial approach than before. Therefore, it is of interest to verify that the latent space remains aligned. Using existing editing techniques, the authors show that latent space guidance can indeed preserve semantic information. Therefore, instead of finding new paths in the latent space of the deformable generator, the authors simply use the path and edit model established for the source generator.

As shown in the figure above, the author uses existing methods to map real images to new domains. Edit expressions and hairstyles with StyleCLIP, poses with StyleFlow, and age with InterFaceGAN. Use an implementation pre-trained on the source domain.

Image-to-image conversion . Richardson et al. validate a large number of image-to-image translation applications by training encoders. These encoders map images of arbitrary domains to the latent space of pretrained generators. Then, the generator resynthesizes the image in its own domain. They demonstrate this approach with conditional synthesis tasks, image restoration, and super resolution. However, this method has a big limitation, the domain where the image is generated must be the domain where the StyleGAN generator was trained.

These pretrained generators can be matched with the authors' warped generators for more general image-to-image translation. In particular, as shown above, conditional image synthesis across multiple domains is performed using segmentation masks and sketch-based guidance without pre-trained encoders.

5.3 Comparison with other methods

Comparing StyleGAN-NADA with the existing technology found:

  • First, our text-guided out-of-domain generation technique cannot be replaced by current implicit editing techniques.
  • Second, StyleGAN-NADA outperforms current few-shot methods for large shape changes.

Editing of text guides . Existing editing techniques that operate within the domain of the trained generator cannot transfer images to other domains.

A comparison of StyleCLIP with the three CLIP-guided editing techniques is shown above. It turns out that all StyleCLIPs are incapable of performing out-of-domain operations, even with very small requirements (eg, adding celebrity traits to dogs).

Few-shot generator . The author compares StyleGAN-NADA with some few-shot methods: Ojha et al., Mine-GAN, TGAN, and TGAN+ADA. In all cases, the official StyleGAN-ADA AFHQ dog model was converted to a cat model. The author operates in a zero-shot method; other methods use 5, 10, and 100 images from AFHQ-Cat for training.

The comparison results are shown in the figure above. In comparison, StyleGAN-NADA has better performance.

The comparison results of Quality are shown in the figure above. See Appendix H for more details .

Furthermore, the authors found that using StyleGAN-NADA before few-shot improves synthesis performance in most cases. See Appendix C.

5.4 Ablation study

The experimental results are shown in the figure above. In all domains and changes, the global loss does not produce good results. Meanwhile, our adaptive layer-freezing, directional loss model has the best results. In some cases, a latent mapper can be trained to further improve performance. 

6 Conclusion

StyleGAN-NADA, a CLIP-guided, zero-shot approach for Non-Adversarial Domain Adaptation image generators. Guiding the training of the generator with CLIP, rather than exploring its latent space, enables dramatic changes in style and shape beyond the generator's original domain.

This method also has limitations. Based on CLIP, limited by CLIP. Text-guided methods are limited by the ambiguity of natural language.

The method in this paper performs well for style and detail changes, but it is difficult for large-scale features and geometric properties. Few-shot also faces this problem.

The authors found that good conversion requires a very similar pretrained generator as a starting point. The authors focus on improving existing generators. This raises the question of whether it is possible to train a generator from scratch using only CLIP bootstrapping, independent of the above requirements. While the idea is a bit difficult, recent advances in inversion classifiers and generative artwork make it seem achievable.

reference

Gal , R. , Patashnik , O. , Maron , H. , Bermano , AH , Chechik , G. , & Cohen-Or , D. (2022). StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG)41 (4), 1-13.

appendix

A. Broader Impact

Because CLIP is trained using a large number of images collected from the web, the models supervised by it may be biased by these data. For example, when generating faces, if the text "doctor" is used to guide, most of the generated faces are male; if the text "nurse" is used to guide, most of the generated faces are female. In Appendix C, the authors address this problem using a small number of images.

B. Analysis of the CLIP space

By visualizing the behavior in the embedding space of CLIP, the authors analyze the difference between the guided CLIP loss and the traditional global distance minimization method.

We first embed images of AFHQ cats and dogs into the CLIP multimodal space. Then use PCA to project them to 2D. Similarly, embedding and projection are performed on the text "cat" and "dog", as well as on fake images synthesized by the author's generator trained with global loss and guided loss.

The experimental results are shown in the figure above. In the case of using a global loss (15.b), optimize towards a single objective. This leads to the collapse into a single region of the embedding space, since there is no gain in maintaining a diverse distribution. In contrast, using a guided loss (15.c) prevents collapse and maintains a high degree of diversity.

C. Few-shot CLIP Guidelines (Guidance)

While our approach focuses on zero-shot domain adaptation, similar ideas can be used for few-shot training. Here we investigate two methods for few-shot domain adaptation using semantic information from CLIP.

Image-based guidance . In the first approach, consider a scenario where a small number of images (about 3-5) are available. In this scenario, instead of guiding the CLIP space described by two texts, the CLIP space guidance between the image produced by the original image generator and the domain of the small image set (real image, ground truth) is considered. This method uses CLIP to encode the semantic difference of images in the two domains, then the guided loss can be expressed as:

where \mathop N\nolimits_ris the size of the real image set and \mathop I\nolimits_iis the i-th image in the image set. \mathop N\nolimits_sis the number of images sampled from the source domain generator, which is set to 16 in this experiment. (use images as guides for conversion, not text)

This method retains some advantages of few-shot:

  • The structure of the latent space is better maintained, and the identity of the characters is better preserved (as shown in the figure above);         
  • use less time to train;
  • Neither description image alignment of the target domain is required nor preprocessing in a way suitable for the source domain.

Compared with StyleGAN-NADA, this few-shot method can alleviate the limitations of some models.

  • In particular, language ambiguity is avoided through some samples from the target domain.
  • Furthermore, it can generate styles that are difficult to describe in text. However, since CLIP's embedding space is semantic in nature, this method cannot guarantee precise translation to this style.
  • Finally, the method can handle the bias introduced by CLIP. For example, the problem of CLIP-prone male doctors and female nurses can be avoided by providing images of doctors of both genders. As shown in FIG.

Zero-shot pre-training . The second method considers a scene where tens or hundreds of images are available. In this scenario, the few-shot model has better performance. However, this performance relies on the similarity between the source generator and the target domain. Using the zero-shot method can reduce the gap between the two domains, thereby improving performance.

Let F denote {\rm{\{ }}\mathop I\nolimits_{​{\rm{real}}} {\rm{\} }}the few-shot method for generator domain transfer using image sets, and let N denote the zero-shot method, then few-shot adaptation can be expressed as:

where, \top G\nolimits_S \top {,G}\nolimits_Tdenote the generators of the source and target domains, respectively. \mathop G\nolimits_{NADA}is the intermediate generator, \mathop t\nolimits_{source} ,\mathop t\nolimits_{t\arg et}obtained by applying a zero-shot to the source domain generator using text bootstrapping.

Typical few-shot adaptive methods require a well-trained discriminator for the source domain. Whereas StyleGAN-NADA only changes the generator. Therefore, possible solutions to the three generator-discriminator differences are investigated here.

  • First, simply ignore the differences and use the few-shot approach normally, \mathop G\nolimits_{NADA}while using the source discriminator \mathop D\nolimits_s.
  • Second, let the discriminator catch-up: do a small number of training iterations first, and only update the discriminator. \mathop G\nolimits_{NADA}As the source of fake data, the few-shot set serves as a sample of the target domain.
  • Third, use \mathop G\nolimits_{NADA}and CLIP mapper to generate a large number of images, and then use them to fine-tune them before using them as the source of few-shot\mathop G\nolimits_S ,\mathop D\nolimits_S

First use 10 dog-cat images to compare the three schemes. In all cases use the same number of training iterations as the baseline model (including iterations where only the discriminator is trained). For the setting of "fine-tuning G+D", 25,000 images are generated, and then these images are iterated 5,000 times to fine-tune the original dog model. Using the FID metric proposed by Ojah et al. (smaller is better): 5000 images are sampled from the fine-tuned model and compared to the full target set (not few-shot).

The result is shown in the figure above.

  • The results show that simply training the discriminator before training starts improves performance. The authors hypothesize that the initial synchronization of the discriminator helps focus on features that distinguish real cats from fakes, rather than features that distinguish cats from dogs.
  • For the case of fine-tuning, training with images generated by the CLIP mapper may reduce diversity and thus affect subsequent adaptation.
  • As shown in the figure above, MineGAN has better performance when using only the pre-trained generator scheme. This may be because it manages to determine the hidden region where high-quality cat images are located, and then focuses the network's attention on this region.

After determining that discriminator "catch-up" could improve performance, the authors turned to evaluating the performance of their own pretrained models on additional domains and levels of supervision.

The result is shown in the figure above. All experiments employ catch-up.

In almost all cases, using the author's zero-shot pre-training has better performance, and in some cases, the performance has improved by 40%;

These results suggest that our method helps reduce domain variance before using few-shot, leading to improved performance.

D. Beyond StyleGAN

In addition to StyleGAN, the author explores the ability of StyleGAN-NADA: using OASIS (a SPADE-like model that synthesizes images based on segmentation masks) to convert existing categories in a more local way. In this setting, a model pre-trained on the COCO-stuff dataset is used, and one class of the model is transformed into a new class, which has the same shape as the source class, but their details are different. To this end, the author used the same training and training loss (see the main text of the paper) and made two changes:

  • First, mask all unspecified regions before passing the generated image to CLIP. Therefore, only use the category area that you want to change for the calculation of CLIP spatial guidance
  • Second, minimize changes in all regions outside the mask, and all images that do not contain the specified category.

Then, L2 and LPIPS loss are used between all mask regions output by the source and target generators. Qualitative results are shown in the figure above. The results show that our framework can be easily applied to other generative models. In this case, StyleGAN-NADA is not a dedicated StyleGAN tool, but a general framework for training generative models without using data.

E. Identity Preservation

The author gives additional examples to show that StyleGAN-NADA can preserve identity information across different domains.

The results of domain adaptation using synthetic and real images are shown in the figure above. StyleGAN-NADA successfully preserves the identity information of the source domain in the new domain, even transforming accessories (e.g., hats, eyes) appropriately. 

F. Cross-Model interpolation 

In addition to supporting interpolation and editing of latent spaces in new domains, StyleGAN-NADA can also interpolate model parameters between two domains. This further proves the strong coupling of the StyleGAN-NADA latent space. This is beneficial for other applications, eg, generating images with smooth transitions in many domains. The image above shows this transition.

G. More samples (Additional samples)

The author synthesized a large number of images.

The above two pictures are transferred from the face domain. 

The picture above is transferred from Church Domain. 

The above picture is transferred from the dog domain. 

Metamorphosis of dogs into other animals.

H. Qualitative Few-shot Comparison

The authors converted dogs to cats with StyleGAN-NADA and a few-shot model trained with AFHQ-Cat samples, respectively. The comparison results are shown in the figure above, and StyleGAN-NADA has better performance.

The author uses the few-shot setting of Ojha et al. to turn photos into sketches. The result is shown in the figure above. This paper's StyleGAN-NADA performs better and retains more details of the source domain. However, in these portraits, designing a text that precisely describes the style of the target domain is difficult. By using three images from the target domain (Appendix C) instead of text, the gap between styles can be reduced, but not completely eliminated. Even when switching between "nearby" domains, most of the good methods produce large defects, have huge mode collapse, or can only retain very little information about the source domain. 

I. Training Details

Hyperparameter selection . The hyperparameter settings are shown in the figure above. Because the training converges quickly, the number of iterations can be set arbitrarily high, and then look at the immediate output of the model to determine when the model will have a good result.

StyleCLIP mapper . As studied in Sec.5, in some scenarios, the StyleCLIP hidden mapper can be used to determine the hidden region matching the target domain. Unfortunately, the mapper occasionally produces unexpected errors on the image, for example, opening the animal's mouth or elongating the tongue. The authors found that the error is related to the growth of the norm of the embedding in the generated image CLIP space. To solve this problem, when training the mapper, these norms are limited by introducing an additional loss, so as to suppress the mapper from making these errors.

where \mathop E\nolimits_Iis the CLIP image encoder, G is the fine-tuned generator, w is the sampled hidden codeword, and M is the hidden mapper.

J. Licenses and data privacy 

The models used in this article, the training data sets and their respective licenses are shown in the above two figures. 

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/129472865
Recommended