(2022, Entity Migration) Generic one-shot domain adaptation for GANs

Generalized One-shot Domain Adaptation of Generative Adversarial Networks

Official account: EDPJ

Table of contents

0. Summary

1 Introduction

2. Related work

3. Basics

4. Method

4.1 Overview

4.2 Style Fixation and Example Reconstruction

4.3 Internal distribution learning 

4.4 Manifold Regularization 

5. Experiment 

5.1 Comparison with SOTA methods

5.2 Ablation studies 

5.3 Further results 

6. Conclusions and limitations 

reference

A. Appendices

A.1 The effect of introducing entity masks

A.2 Comparison of different distribution matching losses

A.4 Auxiliary Network

A.5 Failure Cases

A.6 Mask Guided Migration

A.8 Hyperparameter selection

A.8.4 Can the auxiliary network be placed elsewhere?

S. Summary

S.1 Core idea

S.2 Network Architecture

S.3 LOSS


0. Summary

Domain adaptation of generative adversarial networks (GANs) aims to transfer pretrained GANs to target domains with limited training data. In this paper, we focus on the one-shot case, which is more challenging and rarely explored in previous work. We argue that the adaptation from the source domain to the target domain can be divided into two parts: the transfer of global styles such as texture and color, and the emergence of new entities that do not belong to the source domain. While previous work has mainly focused on style transfer, we propose a novel and concise framework to address the general one-shot adaptation task of style and entity transfer, where a reference image and its binary entity mask are provided. Our core idea is to limit the gap between reference and synthetic internal distributions by slicing the Wasserstein distance. To achieve it better, first use style fixation to roughly obtain example styles, and then introduce an auxiliary network in the generator to decouple entity and style transfer. Furthermore, to achieve cross-domain correspondence, we propose variational Laplacian regularization to constrain the smoothness of the adaptive generator. Both quantitative and qualitative experiments demonstrate the effectiveness of our method in various scenarios.

1 Introduction

We believe there are limitations in previous task settings because samples contain more information than art styles (including color and texture). As shown in Figure 1, the source domain refers to natural human faces, and the exemplars of the target domain are artistic portraits with some entities (eg, hats and accessories). Previous work only focuses on the transfer of artistic styles and ignores entities. This can cause two problems:

  • In some cases, entities considered to be an important part of domain characteristics should be transferred at the same time as styles.
  • Large-sized entities, such as masks in Figure 1, can easily bias the style extraction of face regions and cause unintended artifacts in adaptive synthesis.

Therefore, we refine the task setting with the help of an additional binary mask, which marks our interested entities and helps to define the target domain more accurately and flexibly. We name the new task generalized one-shot GAN adaptation. The scenario in previous work can be seen as a special case when the binary mask is all zeros. We believe that the exploration of new tasks is very meaningful, not only for the artistic creation of more general and complex scenes, but also for further discovery and utilization of knowledge in pre-trained models. 

To address new adaptation problems, we propose a concise and effective adaptation framework for StyleGAN.

  • First, we modify the architecture of the original generator to decouple the adaptation of styles and entities, where the new generator contains an additional auxiliary network to facilitate the generation of entities, while the original generator is dedicated to generating stylized images.
  • Second, unlike previous works that use CLIP similarity or GAN loss to learn domain knowledge, for style and entity transfer, our framework directly minimizes the difference in the internal distribution of samples and synthetic images by slicing the Wasserstein distance. Combining style fixation: All synthesized images are associated with sample styles through transformations in the latent space.
  • Third, inspired by cross-domain consistency and classical manifold learning, we propose variational Laplacian regularization to smooth network changes before and after adaptation to preserve the geometry of the source generated manifold and prevent the training process from The content in is distorted.

We conduct extensive experiments on various references with and without entities. The results show that our framework can fully exploit cross-domain correspondences and achieve high transfer quality. Furthermore, adaptive models can easily perform various image processing tasks. 

2. Related work

GAN adaptation leverages the knowledge stored in pre-trained GANs to alleviate data scarcity and speed up training on new domains. The first literature to study and evaluate different adaptation strategies shows that transferring a pretrained generator and discriminator simultaneously improves convergence time and quality of generated images. Since then, an increasing number of adaptation strategies have been proposed for various GAN architectures. As StyleGAN and its variants continue to make breakthroughs in generating various types of data, designing adaptive strategies for StyleGAN has gained widespread and continuing attention. For example, zero-shot GAN adaptation, which transfers StyleGAN to the target domain defined by text prompts with the help of CLIP. There are also studies on transferring StyleGAN to the domain defined by the given few-shot images.

Among the above works, our work is closely related to FSGA, which proposes a cross-domain consistency (CDC) loss to preserve the diversity of source generators and make the transferred images and source samples have a correspondence. Although the CDC loss is ineffective for FSGA when the training data is only one-shot, its mechanism inspires us to consider more effective regularization to maintain the relation, which will be discussed in Section 4.4.

recently,

  • Some have explored one-shot adaptation tasks.
  • Some have adaptively aligned sample and reference images in the CLIP feature space.
  • Some learn style mappers by constructing massive pairing datasets.
  • Some people introduced additional hidden mappers and classifiers in the original GAN.
  • In contrast, we focus on style and entity adaptation, which has never been studied before. And we demonstrate that aligning the internal distribution with appropriate regularization is effective for GAN adaptation with style and entity decoupling.

3. Basics

StyleGAN. A typical StyleGAN can be divided into two parts: First, given the noise distribution P(z), the mapping network transforms the noise from the space Z to the latent space W. Second, the synthesis network G consists of convolutional blocks to transfer the W space to the image space M. Another important concept is to extend the latent space W+, where latent coding operations incorporate their information via a linear combination W. Empirically, the W+ space is inferior to W in terms of generative fidelity and editability, but better in GAN inverse mapping, i.e. finding the encoding w_ref to reconstruct the reference image y_ref. In this article, we only focus on tuning G.

Manifold learning. Consider a matrix X of samples from a source manifold M_s , W is the weight matrix, where w_(i,j) are the weights of samples x_i and x_j, usually defined as

Given a task-specific function f and y = f(x), manifold learning requires that transformed samples Y on a new manifold M_t should preserve the geometry of M_s, which means that data that are close in M_s are also close in M_t, and vice versa Of course. This requirement is mainly achieved by minimizing the manifold regularization term

▽ M_s is the gradient along the manifold and P(x) is the probability measure. If M_t is a submanifold, according to Stokes theorem, the above formula is equivalent to

△ is called the Laplace–Beltrami operator, which is the key geometric object in the Riemannian manifold, so the integral is also called Laplace regularization. In practice, it is estimated by the following discrete form,

L = D - W is called graph Laplacian, D degree matrix, other elements are zero, diagonal elements are

Intuitively, regularization promotes the local smoothness of f. If the data is densely distributed in an area, the function should be smooth to avoid local oscillations. If x_i and x_j are close, w_(i,j) will impose a strong penalty on the distance between yi and yj in discrete form. 

Task definition. Generic one-shot adaptation can be formally described as: given a reference y_ref, a binary mask m_ref is used to mark the locations of entities contained in y_ref. The target domain T is defined to contain images with similar styles and entities to y_ref. Using the knowledge stored in the pre-trained generative model G_s on the source domain S, a generative model G_t is learned from y_ref and m_ref to generate different images belonging to domain T. Note that we can set m_ref = 0 when the entity of interest does not exist in y_ref. Furthermore, the adaptation should satisfy cross-domain correspondence, i.e., for any latent code w, Gs(w) and Gt(w) should have similar shapes or contents visually. With the correspondence, the adaptive model can be used not only for synthesis but also for transfer and manipulation of source images. 

4. Method

4.1 Overview

The framework for the general one-shot GAN adaptation task is shown in Fig. 2. In the next sections, each component of the framework is described in detail by taking StyleGAN pretrained on FFHQ as an example. 

network architecture . Unlike previous work that directly adopts the source generator G_s, our target generator G_t (shown in Figure 2) consists of a generator ^G_t inherited from G_s and an auxiliary network (aux) trained from scratch. We are inspired by the fact that G_s can approximate arbitrary real faces but cannot handle most entities, since G_s captures the universal elements of FFHQ (clear natural faces) rather than the tails of the distribution (with hats or ornaments, etc. face of the entity). The design makes aux serve entities, while ^G_t only focuses on stylizing faces to benefit from the prior knowledge stored in G_s. As shown in Fig. 2(d), for each hidden code w, its feature map generated by the convolutional block of StyleGAN is fed to aux, where UNet predicts the feature map f_ent of the entity and the mask m that accurately marks the entity location, They are then merged with f_in by Equation 1 to obtain the entity-covered face feature map.

⊙ represents the Hadamard product. We want the shape of the entity to be affected only by the synthesized content, so we place aux after the fourth block of StyleGAN, where the 32x32 resolution f_in contains the most content information and the least synthesized style information (see Section 4.2). Because global styles and local entities are usually independent, the decoupled design also prevents the styles of faces from being polluted by the styles of entities (see Section 4.3). 

training process . Our training process can be broken down into three main parts:

1) Style fixes and paradigm rebuilds. In Figure 2(a), we get the hidden code of y_ref through GAN inverse mapping

And train G_t to reconstruct y_ref with reconstruction loss L_rec. In Figure 2(b), each w will be transformed into a style-fixed encoding

Roughly get the example style.

2) Internal distribution learning. We minimize the differences in internal patch distributions between synthesis and paradigms for style and entity transfer. Instead of using the hard-to-optimize GAN loss, we employ sliced ​​Wasserstein distance (SWD) as the style loss L_style and entity loss L_ent to efficiently learn the inner distribution.

3) Manifold regularization. To suppress content distortion during training, we propose variational Laplacian regularization L_VlapR to smooth the change from G_s to G_t by preserving the geometric structure of the source manifold. As shown in Figure 2(c), the overall loss is

The details of the training process are illustrated and discussed in the following sections. For descriptive and self-consistent purposes, some abbreviations are used: 

4.2 Style Fixation and Example Reconstruction

As we all know, StyleGAN uses decoupled latent spaces W and W+. The encoding vector behind the hidden encoding mainly determines the style of synthesis, while the previous vector determines the rough structure or content of the synthesis. Taking advantage of this feature, we first roughly transfer the sample styles to other compositions via fixed-style encoding. The sample y_ref is fed into the pre-trained GAN inverse map encoder e4e to obtain the latent code

The last two items encode the content and style information of the sample, respectively. For each hidden code w, before feeding into the network, it will be transformed into

The content part of w is preserved, and the style part is replaced with the style of the sample, and we denote the new space as W#. Since style fixation does not change the content of the original composition, by using a pre-trained Arcface model to compare identity similarity before and after style fixation, we found that setting l = 8 is acceptable, resulting in an average similarity of 65%, and visually obtained Enough sample style. In MTG, style mixing as a post-processing step helps to improve style quality, but experiments (Fig. 3) show that the synthesized styles are underfit and inconsistent. In our framework, it is used as a preprocessing step to facilitate subsequent learning. 

Because x_rec is just the projection of y_ref on the source domain, they often have obvious differences visually, as shown in Figure 2. Therefore we adopt the reconstruction loss L_rec to narrow the gap, 

dssim represents the negative structural similarity measure, lpips is the perceptual loss, and m_rec is the upsampled reconstruction mask from aux. It is worth noting that L_rec acts on G_t rather than the original architecture ^G_t. When m_ref = 0, aux has no effect and G_t is equal to ^G_t. 

4.3 Internal distribution learning 

Recent work on internal learning demonstrates that the internal patch distribution of a single image contains rich meaning. Minimizing the difference in internal distributions can solve many tasks such as image generation, style transfer, and super-resolution. It is usually implemented by an adversarial patch discriminator. However, GAN training is known to be time-consuming, unstable, and requires large GPU memory. Previous few-shot GAN adaptations using GAN loss almost always suffer from severe model collapse. We found that sliced ​​Wasserstein distance (SWD) can lead to the same destination, but more efficiently. For two tensors

The SWD of their internal distribution is defined as 

projection(proj) projects each pixel from d channels to a scalar through θ, and sort denotes a sort operator that sorts all values. SWD can be easily implemented by convolution with random 1x1 kernels and quicksort algorithm. 

To achieve style and entity transfer, we use SWD to minimize the difference in the internal distribution of the synthetic and reference. The style loss is defined as 

Φ_style is a set of convolutional layers from the pretrained l_pips network to extract spatial features. Note that the generator learns the style of ^y_rec instead of y_ref to prevent the entity's style from leaking to other regions. Entity loss is defined as 

Φ_ent is also a set of convolutional layers from the pretrained l_pips network, and m↑ is upsampled by m to get the same resolution as the final composition. put the reconstruction entity

Targeting can be seen as an implicit data augmentation technique to avoid overfitting. 

4.4 Manifold Regularization 

The training loss, especially the style loss, inevitably distorts the content of the source synthesis and changes their relative relationship. To achieve cross-domain correspondence, we use variational Laplacian regularization L_VLapR to mitigate distortions. Since we did not tune the StyleGAN mapping network, the W space combined with the source generator G_s does not change, so we define

where Φ is a pre-trained semantic extractor, here we use a pre-trained CLIP image encoder, which is sensitive to both style and content variations. L_VLapR can be interpreted from two perspectives. The first perspective is the smoothness of the function: since the integrand also represents the residual

, minimizing L_VlapR expects the synthesis of G_s and G_t across the latent space W to have smooth semantic differences. In an ideal case where the integrand is nearly zero, G_t (w#) and Gs (w#) have similar differences in style and entity for various latent codes w in local regions. A second perspective is the preservation of geometric structure: According to the Laplace regularization theory, the integral form in Equation 8 can be estimated in discrete form:

L is the Laplacian matrix (see Section 3). In Equation 9a, if w_i and w_j are close in latent space, w_(i,j) will be a large scalar, causing

same. This means that the relative distances of the adaptive front-to-back adjacent compositions are equidistant in the feature space. In practice, the loss can be computed efficiently using Equation (9b). The derivation is provided in the supplementary material.

Relationship between LV lapR and LCDC . The recent influential work FSGA introduces a cross-domain distance consistency loss L_CDC to achieve cross-domain correspondence. For any w_i and w_j, without loss of generality, FSGA defines the conditional probability in Equation 10

Forming a similarity distribution and then minimizing their Kullback-Leibler divergence, Equation 11, encourages the same similarity of the synthetic images of G_t and G_s. 

In the above equation, Φ denotes the feature map of the corresponding generator. From a manifold learning perspective,

Describing the neighborhood structure of x_i and y_i respectively, LCDC tries to place x_i in a new space to optimally preserve the neighborhood identity. In fact, we find that this idea is equivalent to stochastic neighborhood embedding [12, 40], which is also a popular method for preserving manifold structure.

L_CDC has the following disadvantages:

First, large batch estimates for high-resolution GAN

almost unacceptable. L_VLapR directly constrains the distance and performs well even when the batch size is 2.

Second, L_CDC is non-convex and difficult to optimize. In practice, the loss value is usually around 1e-5, which means that L_CDC does not provide an effective penalty for variance.

Finally, due to

The softmax form of L_CDC is scale-invariant to the input, and L_CDC cannot guarantee the isometric relationship between the source image and the adaptive composite image like L_VLapR, so it is difficult to avoid mode collapse. In supplementary material, we show that replacing L_CDC with L_VLapR greatly compensates for FSGA collapse. 

5. Experiment 

Our implementation is based on the official code of StyleGAN. We use a pretrained l_pips network and a VGG architecture containing five convolutional blocks. We compute the integral in Equation 5 using Monte Carlo simulations randomly sampling 256 vectors on the unit sphere. The data augmentation used consists only of horizontal flipping. More experimental details and results can be found in the supplementary material.

5.1 Comparison with SOTA methods

We compare the few-shot adaptive method FSGA, the one-shot adaptive method MTG, OSCLIP and the one-shot face style transfer method JoJoGAN in the common face domain. As shown in Figure 3, since these works cannot yet handle entities, for fair comparison, we selected three portraits without entities (Sketch, Disney, Arcane) and three portraits with entities from AAHQ ( hats, Zelda ornaments and masks). Their masks can be obtained by manual annotation, or by semantic segmentation. 

qualitative results . Figure 3 shows the qualitative results. The top row images are source natural faces, which are inverse mapped by e4e to obtain latent encodings. From the results, we can draw the following conclusions:

  • First, our synthesis is competitive as a given reference. Although displayed as thumbnails, they have more pronounced high-frequency detail and look sharp and realistic, whereas the JoJoGAN results are very blurry.
  • Second, our method achieves the strongest cross-domain correspondence, preserving the content and shape of the source image, which is undoubtedly a great improvement compared with other methods. This also means that our method can maintain the diversity of source models without causing content collapse.
  • Third, our method can generate high-quality entities that can adapt to various face shapes and look harmonious with the rest of the synthesis.

Experiments demonstrate that our method has sufficient visual advantages over competitors.

Quantitative results . Previous work mainly conducts user research to compare visual quality. Following them, we surveyed users' preferences for different methods. We first randomly sample 50 latent codes in the W space and feed them into various models. We provide reference materials, source synthetic images, and adaptive synthetic images to volunteers. They considered both style and content, and subjectively voted for their favorite adaptive composite images. Furthermore, to objectively evaluate cross-domain correspondence, we use a face alignment network to extract facial landmarks and compute the normalized mean error (NME) of landmarks between the source synthetic image and the adaptive synthetic image

Among them, l_mk represents a 68-point 2D landmark extractor, and n is set to 1000. NME reflects the difference in face shape before and after adaptation. We further report the identity similarity (IS) predicted by Arcface to measure the preservation of identity information. Models with lower NME and higher IS show better cross-domain correspondence properties.

The quantitative results are shown in Table 1. Our method undoubtedly achieves the best scores on NME and IS. It has stable performance and gets similar NMEs in different domains. FSGA and OSCLIP achieve relatively poor scores due to overfitting, as shown in Figure 3, and their common feature is to use GAN discriminator for training. MTG performs very unstable, especially when the target domain has strong semantic style, such as Disney and Zelda. Compared to our method, JoJoGAN performs well on NME but not on IS. For user research, we finally collected valid votes from 53 volunteers, and users spend an average of 4 seconds on a question. Our method gets around 60% of the votes on each adapted model, while the other methods get the rest on average. This proves that our method is more popular with users.

training and inference time. Our method takes about 12 minutes for m_ref ≠ 0 and 3 minutes for m_ref = 0 on an NVIDIA RTX 3090. FSGA, MTG, JoJoGAN, OSCLIP took about 48, 9, 2, 24 minutes, respectively. All of this takes about 30 milliseconds to generate the image.

5.2 Ablation studies 

We start with a basic reconstruction loss L_rec and continuously improve the framework to observe qualitative changes in synthesis.

  • As shown in Figure 4, the target sample is a hard case depicting a man with a beard and a hat, and style fixation can bring crude but significant improvements to the synthesized style.
  • While L_style can enhance the style of the face, it can degrade hair and introduce noise like beards on female faces.
  • L_ent plays a key role in the generation of entities, without it, hats cannot be crafted.
  • In particular, L_VLapR has a significant suppression effect on the noise in the previous synthesis.
  • Finally, we removed L_style and style fixation respectively, so the composition is no longer the target style. This means that they both play a vital role in style transfer.

We also provide quantitative ablation in Table 1 to study the effect of Lstyle and LV LapR. It shows that removing L_style has a positive effect on NME and IS, while removing L_VLapR has the opposite effect. Quantitative results confirm that L_VLapR helps prevent content from being distorted by L_style.

5.3 Further results 

Adaptation to Various Domains In addition to the success in the face domain, our framework can also be applied to other source domains. We exemplify StyleGAN pretrained on the 512x 512 AFHQ dog and cat dataset and the 256x256 LSUN church dataset. As shown in Figure 5, for each source domain, we provide images with entities, and images without entities, which are randomly sampled from the source space. For images without entities, adaptive synthesis is photorealistic and has strong visual correspondences. In other cases, our method also produces satisfactory results. For example, the dog's cartoon eyes are positioned correctly and closely resemble the eyes of the reference. Prove that our framework is effective and has wide generalizability. 

Image Processing. Because our adaptation model preserves the geometry of the source generation space, it can efficiently semantically edit images using the semantic orientations found in the source space. As shown in Figure 6, since our model is able to accurately reconstruct target samples, we can edit samples and synthesize. We choose three representative semantic orientations to vary smile, age, and pose rotation, and the results show that the adaptive generator achieves precise control not only on faces but also on entities. In addition, our model can also change the style by changing the style encoding, which is hard for previous works. Additionally, our framework can delete entities with ^G_t and manipulate the results similarly. We think these features will help in artistic creation.

6. Conclusions and limitations 

In this paper, we build a new GAN adaptation framework for the general one-shot GAN adaptation task. Because our framework relies heavily on the learning of internal distributions, it inevitably has some limitations. We think the most important thing is that it cannot precisely control the position of the entity, which may cause failure when the pose changes too much. Another limitation is that entities cannot be too complex, otherwise it will be difficult to learn through the patch distribution. However, these edge cases require additional consideration. Overall, our framework is effective in various scenarios. We believe that the exploration of new tasks and the introduction of manifold regularization are important for future work.

reference

Zhang Z, Liu Y, Han C, et al. Generalized One-shot Domain Adaption of Generative Adversarial Networks[J]. arXiv preprint arXiv:2209.03665, 2022.

A. Appendices

A.1 The effect of introducing entity masks

The introduction of additional entity masks is significant for the one-shot adaptation task, to illustrate that we adapt GANs using the same reference with and without masks.

  • The results in Figure 1(a) demonstrate that masks help to clearly define the target domain. If there is no mask, the compositing gets only the sample style, otherwise both the entity and the style.
  • Figure 1(b) shows that without the mask, the hair is polluted by the color and texture of the Santa hat, so the mask can prevent the entity's style from negatively affecting other regions.
  • In Figure 1(c), the mask allows the model to focus more on interesting objects, noting that the right eye is closer to the reference than the left eye. 

A.2 Comparison of different distribution matching losses

Both style transfer and entity generation can be explained by learning the internal distribution of examples, so we compare SWD with commonly used Gram loss, moment matching loss to demonstrate the superiority of SWD.

  • Although patch GAN loss is a theoretical alternative and the most common way to translate images, the GAN framework is too large (about 50 minutes of training time, GPU memory usage is large), and it is prone to serious overfitting of a single target image .
  • In theory, SWD can fully capture the distribution. For two distributions p and q, SWD(p, q) = 0 ⇔ p = q.
  • However, the vanishing Gram loss only means that p and q have the same expectation (ie first central moment).
  • And the disappearance of the moment matching loss just means that p and q have the same higher-order central moments.

Therefore, SWD, Gram loss and moment matching loss are theoretically feasible for style adaptation. It is well known that a successful generative model needs to learn precise distributions, thus only SWD can be used for entity adaptation. 

A.4 Auxiliary Network

Our auxiliary network consists of the UNet architecture (unet) shown in Figure 4. It receives features f_in from StyleGAN and outputs features f_ent and mask m for entities. They will then be dealt with by Equation 1 of the main thesis. Note that the predicted mask m is at a resolution of 256 × 256 instead of 32 × 32 to label entities more finely. We do not emphasize it in the main paper to highlight our core ideas.

A.5 Failure Cases

As we mentioned in the conclusion, our method can fail in two main situations. As shown in Figure 5, the first case is when the internal distribution cannot accurately guide the shape of the solid, and the synthesized solid is distorted too much. For example, the second row shows that a bandage with a regular shape can be composited well, but an irregular moon or star on hair has poor composite quality. The second case is when the head is rotated too much, the composite entity may not be in the correct position. We think this is because the rotated head images as extreme samples only constitute a small fraction of all generated samples, so it is difficult for stochastic optimization algorithms to optimize for these samples. 

A.6 Mask Guided Migration

In our paper, we focus on adapting GANs to new domains, so the framework needs to predict each latent entity mask, which is not always accurate, such as the right plot in Fig. 5. In practice, we may wish to transfer only the styles and entities of our desired images, or provide manually annotated masks to improve the location of entities. This functionality can be achieved by slightly modifying our framework. If the source (content) image corresponds to a latent w_0 and there is a user-provided mask m_user, we simply feed w_0 to the framework instead of a random latent, and constrain the predicted mask m_0 with ∥m0 - m_user∥^2. As shown in the image above, location entities can be controlled by a bootstrap mask. This is of great benefit to artistic creation in practice.  

A.8 Hyperparameter selection

A.8.4 Can the auxiliary network be placed elsewhere?

In the main paper, an auxiliary network (aux) is placed after the fourth convolutional block to receive a 32×32 feature map containing the most shape information and the least style information. Through experiments, we found that placing aux after the fifth convolutional block to receive 64×64 feature maps is an alternative, and nowhere else is acceptable. As shown in Figure 11, when placed after the third block, the 16×16 feature map is too small to provide accurate shape information, and the predicted feature map cannot accurately represent entities, resulting in poor entity quality. When aux is placed after the sixth block that outputs a 128 × 128 feature map, the predicted feature map will preferentially lead to changes in hair or face styles rather than the emergence of new entities. In addition, predicting larger feature maps is more difficult and requires more computational cost. Therefore, we put aux after the fourth block of the paper.

S. Summary

S.1 Core idea

The authors believe that the adaptation from the source domain to the target domain can be divided into two parts: the transfer of global styles (textures and colors, etc.), and the generation of new entities (hats, glasses, etc.) belonging to the target domain. The method of binary mask (0/1 mask) is used to assist entity generation.

The authors propose a novel and concise framework to tackle the general one-shot domain adaptation task of style and entity transfer.

The authors constrain the gap between reference and synthetic internal distributions by Sliced ​​Wasserstein Distance (SWD). The internal patch distribution of a single image contains rich meanings.

The authors use variational Laplacian regularization to achieve cross-domain correspondence.

S.2 Network Architecture

The model architecture used in this article is shown in the figure above. The author added an auxiliary network (aux) to the generator to perform binary masking (marking the entity position) operation on the entities of the reference image. 

S.3 LOSS

The complete loss function used in this paper is shown in Equation 3. 

 

Reconstruction loss L_rec . The reconstruction loss is expressed by Equation 4. The loss consists of the negative structural similarity metric between the reference image and its reconstructed image in the target domain (through the latent space W), the perceptual loss lpips, and the reference image and its upsampled reconstruction mask in the target domain The mean square error between.

  • The first term is to make images reconstructed via the same latent code w have similar structures, thus guaranteeing that the reference image has a similar structure to its reconstructed image x_rec in the source domain.
  • The second term is used to reduce the difference between the reference image and the reconstructed image.
  • The third item is used to correctly label the entity position.

Style loss L_style . The style loss is expressed by Equation 6. Wasserstein distance (SWD) is used to minimize the difference in the internal distribution of the synthetic image and the reference image. Φ_style is a set of convolutional layers from the pretrained l_pips network to extract spatial features.

The generator learns the style of the mask image ^y_rec instead of y_ref to prevent the entity's style from leaking to other regions.

Entity loss L_ent . The entity loss is expressed by Equation 7, which drives the masked reconstructed image to be similar to the generated image.

Variational Laplace regularization L_VLapR . Variational Laplace is expressed by Equation 9.

Φ is used to extract spatial features. In Equation 9a, if w_i and w_j are close in latent space, w_(i,j) will be a large scalar, causing

same. This means that the relative distances of adjacent synthetic images before and after the adaptation are equidistant in the feature space. Minimizing L_VlapR expects the synthesis of source and target generators G_s and G_t across the latent space W to have smooth semantic differences.

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/130970078