DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation (Paper reading)

Nataniel Ruiz, Google Research, US, CVPR2023, Cited: 218, Code, Paper

1 Introduction

Large-scale text-to-image models have made a remarkable leap in the evolution of AI, enabling the synthesis of high-quality, diverse images from given text cues. However, these models lack the ability to imitate the appearance of subjects in a given reference set and synthesize new representations in different environments . In this work, we propose a novel approach for "personalizing" text-to-image diffusion models. Given just a few images of a topic as input, we fine-tune a pretrained text-to-image model so that it learns to tie unique identifiers to specific topics . Once the subject is embedded in the model's output domain, this unique identifier can be used to synthesize novel realistic images of the subject in different scenes. By exploiting semantic priors embedded in the model, combined with a novel self-generated class-specific prior-preserving loss, our technique is able to synthesize images of subjects in diverse scenes, poses, viewpoints, and lighting conditions not present in reference images, while retaining key features of the subject. We also provide a new dataset and evaluation protocol for this new task, which is based on topic-driven generation. Our technique is applied to several previously intractable tasks, including topic recontextualization, text-guided view synthesis, and artistic rendering, while preserving key features of the topic.

insert image description here

2. Holistic thinking

You want a large model to adapt to your theme, such as Figure 1, your dog, you want to put it in various places, or in various forms, you want the dog to be like your dog, but the location and form of P It must also be realistic and figurative. This article is to solve this problem, the main way is the fine-tuning of regularization.

First, you prepare a few pictures of your dog (such as the four in Figure 1), and you want your dog to swim, then I will use the pre-trained diffusion model to generate 1000 pictures of "a dog swimming" , obviously the dogs generated this way vary. Then if you want to become your dog, you only need to use your dog as a unique identifier [V]. At this time, we use 4 pictures of your dog as the training set, and then add [V] on the prompt, "a [V] dog". There is no need to add "swimming" here, we just want the model to learn this unique identifier, and after training, the model will replace [V] with a dog, then all the different dogs in front will become you The family dog ​​is gone.

The training process is simply: 2 datasets, one generated by the model and one with your own topics. 2 text prompts, one is "a [V] category" and one is "a category...". We use the loss function for joint training. During the training process, the model learns "a [V] dog" while learning "a dog swimming", so the model only changes the appearance of the dog. See below for the loss function.

3. Method

In this work, we propose a novel approach for personalizing text-to-image diffusion models (adapting them to user-specific image generation needs). Our goal is to extend the language-visual dictionary of the model so that it ties new words to specific topics that the user wants to generate. Once the new dictionary is embedded in the model, it can use these words to synthesize novel and realistic images of the subject, placing it in different scenes while preserving its key identifying characteristics. The effect is similar to a "magic photo booth" - once a few photos of the subject have been taken, the photo booth will generate photos of the subject under different conditions and scenarios based on simple and intuitive text prompts. More specifically, given several images of a subject (about 3-5), our goal is to embed the subject (features of these few images) into the output domain of the model, enabling it to be synthesized with a unique identifier . To this end, we propose a technique that uses rare token identifiers to represent a given topic and fine-tunes a pretrained diffusion-based text-to-image framework.

We fine-tune the text-to-image model using an input image and a text cue consisting of a unique identifier followed by the subject's class name (e.g., " a [V] dog "). The latter enables the model to leverage its prior knowledge of topic classes while binding unique identifiers. To prevent language drift from causing the model to associate a class name (e.g., "dog") with a specific instance , we propose an autogenous, class-specific prior-preserving loss that exploits the semantic prior of the class embedded in the model , and encourages it to generate diverse instances of the same class as our subject. We apply our method to a variety of text-based image generation applications, including recontextualization of topics, modification of attributes, original artwork, etc., opening new avenues for previously intractable tasks. We highlight the contribution of each component in our method through ablation studies and compare with alternative baselines and related work. We also conduct user studies evaluating the fidelity of topics and cues in our synthesized images, compared to alternative methods.

Given only a few randomly taken images (usually 3-5) of a specific subject without any textual description, our goal is to generate new images of the subject with high fidelity, and follow text cues Variety. We do not place any restrictions on the context of the input image capture settings and subject images. Next, we introduce some background on text-to-image diffusion models (Section 3.1), followed by our fine-tuning technique for tying unique identifiers to topics described in several images (Section 3.2), Finally, a class-specific prior preserving loss is proposed that allows us to overcome the problem of language drift in fine-tuning models (Section 3.3).

3.1 Text2Image Diffusion Model

A diffusion model is a probabilistic generative model that is trained to learn a data distribution by gradually denoising a variable sampled from a Gaussian distribution. Specifically, we perform a pre-trained text-to-image diffusion model x ^ θ \hat x_θx^iInteresting, the model is given an initial noise map ϵ ∼ N ( 0 , I ) ϵ ∼ N(0, I)ϵN(0,I ) and a text encoderΓ ΓΓ and text promptPPThe conditional vector c = Γ ( P )generated by P c = Γ(P)c=Γ ( P ) , generate an imagexgen = x ^ θ ( ϵ , c ) x_gen = \hat x_θ(ϵ, c)xge n=x^i( ϵ ,c ) . They are trained using a squared error loss to denoise a variable noisy image or latent codezt : = α tx + σ t ϵ z_t := α_tx + σ_tϵzt:=atx+ptϵ :
E x , c , ϵ , t = [ ω t ∣ ∣ x ^ θ ( α tx + σ t ϵ , c ) − x ∣ ∣ 2 2 ] (1) \mathbb{E}_{x,c, \epsilon, t}=[\omega_t||\hat x_\theta(\alpha_tx+\sigma_t \epsilon, c)-x||^2_2] \tag{1}Ex , c , ϵ , t=[ oht∣∣x^i( atx+ptϵ ,c)x22](1)

3.2 Customized text2image model

Our first task is to embed topic instances into the model's output domain so that we can query the model for various novel images of the topic. A natural thought is to fine-tune the model using a few-shot dataset of the subject . Fine-tuning generative models (such as GANs) in few-shot scenarios needs to be handled carefully, as this may lead to overfitting and mode collapse, and does not capture the target distribution well. There have been studies on techniques to avoid these problems, although in contrast to our work, these studies are mainly aimed at generating images with similar target distributions, but without the requirement of topic preservation. Regarding these issues, we observe an interesting finding that in a careful fine-tuning setting using the diffusion loss in Equation 1, large text- to-image diffusion models seem to be good at incorporating new information into their empirical knowledge or overfitting to a small set of training images.

Designing custom prompts for few-shot : Our goal is to "plant" a new (unique identifier, topic) pair into the "dictionary" of the diffusion model. To avoid the overhead of writing detailed image descriptions for a given set of images, we opt for a simpler approach of labeling all input images for a subject as "a[identifier][class noun]", where [identifier] is A unique identifier associated with the subject, [class noun] is a rough class descriptor for the subject (e.g. cat, dog, watch, etc.). Class descriptors can be provided by the user or obtained using a classifier. We use category descriptors in sentences to combine prior knowledge of categories with unique identifier embeddings of our topics , and found that using wrong or no category descriptors increases training time and language drift while degrading performance . In essence, we try to take the model's prior knowledge of a specific category and entangle it with an embedding of our subject's unique identifier, so that we can use the visual prior to generate new poses and configurations of the subject in different environments .

Rare Token Identifiers : We often find existing English words (e.g. "unique", "special") suboptimal as the model has to learn to disentangle them from their original meaning and reconnect them to our subject stand up. This requires an identifier with weak priors in both the language model and the diffusion model. One dangerous approach is to pick random characters from the English language and concatenate them to generate a rare identifier (eg "xxy5syt00"). In fact, the tokenizer may split each letter into tokens separately, and those letters have a high prior probability for these letters. We often find that these marks have similar weaknesses as using common English words. Our approach is to find rare tokens in the vocabulary and then use a detokenizer to invert these tokens into text space to minimize the probability that the identifier has a strong prior . We do a rare token lookup in the vocabulary and obtain a sequence of rare token identifiers f ( V ^ ) f(\hat V)f(V^ ), wherefff is a tokenizer that maps character sequences to tokens, andV ^ \hat VV^ is from the notationf ( V ^ ) f(\hat V)f(V^ )decoded text. The length of the sequence can be variablekkk , we find thatk = { 1 , . . . , 3 } k =\{1, ..., 3\}k={ 1,...,3 } relatively short sequences work well. Then, byf( V ^ ) f(\hat V)f(V^ )using the detokenizer to reverse the vocabulary, we obtain the definition of our unique identifierV ^ \hat VVA sequence of characters for ^ . For Imagen, we found that using a uniform random sampling of tokens corresponding to 3 or fewer Unicode characters (without spaces) and using the T5-XXL tokenizer range {5000, …, 10000} works well.

分词器(Tokenizer)是一种将文本分解为离散单元(如单词、子词或字符)的工具或算法。
在自然语言处理中,分词是文本预处理的重要步骤之一,它将连续的字符序列切分成有意义的单元,以便进一步处理和分析。
例如将"我喜欢学习机器学习"分成"我 喜欢 学习 机器学习"。

3.3 Class-Specific Prior Preserving Loss

In our experience, the best result to achieve maximum topic fidelity is to fine-tune all layers of the model . Including fine-tuning layers conditioned on text embeddings, which raises the issue of language drift. Language drift is an observed problem in language models, where a model pretrained on a large text corpus and then fine-tuned on a specific task gradually loses both syntactic and semantic knowledge of the language. To our knowledge, we are the first to discover a similar phenomenon affecting diffusion models, where the model gradually forgets how to generate topics that belong to the same category as the target topic.

Another problem is that the diversity of the output may be reduced. Text-to-image diffusion models naturally have high output diversity. When fine-tuning on a small set of images, we want to be able to generate subjects with new viewpoints, poses, and joints. However, there is a risk of reducing output pose and view variability (e.g., overfitting to a small number of views). We observed that this is often what happens, especially when the model takes too long to train.

To alleviate the above two problems, we propose a self-generated class-specific prior preserving loss to promote diversity and resist language drift . Essentially, our approach is to modelSelf-generated samples to supervise the model, so that the prior is preserved after few-shot fine-tuning begins . This enables it to generate diverse class prior images and preserves knowledge about class priors that can be combined with knowledge about subject instances. Specifically, we use z_{t1} ∼ N(0, I) with random initial noise zt 1 ∼ N ( 0 , I )zt 1N(0,I ) and conditional vectorcpr := Γ ( f ( ” a [ class noun ] ” ) ) c_{pr} := Γ(f(”a [class noun]”))cpr:=Γ ( f ( " a [ c l a ss n o u n ] " )) freezes the ancestral sampler on the pretrained diffusion model to generate dataxpr = x ^ ( zt 1 , cpr ) x_{pr} = \hat x(z_{t1}, c_{pr})xpr=x^(zt 1,cpr) Derive the following:
E ​​x , c , ϵ , ϵ ′ , t = [ ω t ∣ ∣ x ^ θ ( α tx + σ t ϵ , c ) − x ∣ ∣ 2 2 ] + λ ω t ​​′ ∣ ∣ x ^ θ ( α t ′ x + σ t ′ ϵ ′ , cpr ) − xpr ∣ ∣ 2 2 ] (2) \mathbb{E}_{x,c,\epsilon,\epsilon',t}=[\ omega_t||\hat x_\theta(\alpha_tx+\sigma_t \epsilon, c)-x||^2_2]+\lambda \omega_{t'}||\hat x_\theta(\alpha_{t'x}+ \sigma_{t'}\epsilon', c_{pr})-x_{pr}||^2_2] \tag{2}Ex , c , ϵ , ϵ,t=[ oht∣∣x^i( atx+ptϵ ,c)x22]+l ot∣∣x^i( atx+ptϵ,cpr)xpr22]( 2 )
where the second term is a priori preserving term, which supervises the model with images generated by the model itself, and λ controls the relative weight of this term. Figure 3 illustrates the process of fine-tuning the model using class-generating samples and a prior-holding loss. Despite its simplicity, we find this prior-preserving loss to be very effective in promoting output diversity and overcoming language drift. We also found that it is possible to train for more iterations without overfitting. We found that in Imagen, using λ=1 and learning rate1 0 − 5 10^{-5}105 , and use 5 × 1 0 − 6 5×10^{-6}in Stable Diffusion5×10−6 , about 1000 iterations, and a subject dataset size of 3-5 images are sufficient to obtain good results . In the process, about 1000 "a [class noun]" samples were generated, but fewer samples could be used. For Imagen, this training process takes about 5 minutes on a TPUv4 and 5 minutes on an NVIDIA A100.
insert image description here

4. Experiment

In this section, we show some experiments and applications. Our method enables extensive text-guided semantic modification of topic instances, including recontextualization, modification of topic attributes (such as material and species), artistic rendering, and viewpoint modification.

It is important that in all these modifications we were able to preserve the unique visual character that gives the subject its identity and essence. If the task is recontextualization, then the characteristics of the subject will remain the same, but the appearance (e.g. pose) may change. If the task is a stronger semantic modification, such as crossing our subject with another species/object, then key features of the subject will be preserved after the modification. In this section, we use [V] to refer to the subject's unique identifier.

We collected a dataset of 30 subjects , including unique objects and pets, such as backpacks, stuffed animals, dogs, cats, sunglasses, cartoons, etc. We also collected 25 hints : 20 recontextualization hints and 5 object property modification hints; 10 recontextualization hints, 10 accessory hints and 5 theme/pet property modification hints. To evaluate the suite, we generated four images for each subject and each cue, for a total of 3000 images. This allows us to robustly measure the performance and generalization ability of our methods.
insert image description here

4.1 Ablation experiment

Prior Preservation Loss Ablation (PPL): We fine-tuned the Imagen model on 15 subjects in our dataset, using our proposed (PPL) and not using PPL for comparison. The prior preserving loss aims to resist language drift and preserve prior information. We compute the prior preserving measure (PRES) by computing the pairwise similarity of the average DINO embedding between generated images of random subjects and our subject-specific real images. A higher measure of this indicates a higher similarity between a random subject of that category and our specific subject, indicating a breakdown of prior information. We compute a diversity measure (DIV) between generated images of the same subject with the same cues using the average LPIPS cosine similarity. We observe that the model trained with PPL has higher diversity (slightly reduces the accuracy of the subject), which can also be observed qualitatively in Figure 5, the model trained with PPL overfits the environment of the reference image With fewer combinations, a greater variety of dog poses and expressions can be generated.
insert image description here
Class-Prior Ablation Experiments: We fine-tune Imagen on a subset of our dataset (5 subjects), trained with no class nouns, randomly sampled wrong class nouns, and correct class nouns. For our topics, with the correct class nouns, we are able to faithfully fit the topics and leverage class priors to generate our topics in various contexts. When using the wrong class noun (e.g. "can" for a knapsack), our subjects clashed with the class prior, sometimes resulting in cylindrical knapsacks or incorrectly shaped subjects. If we do not use categorical nouns for training, the model cannot take advantage of categorical priors, has difficulty learning topics and convergence, and may generate wrong samples. Topic accuracy results are shown in Table 4, and our proposed method has significantly higher topic accuracy.

insert image description here
Limitations: We show some failure cases of our method in Fig. 8. The first has to do with the inability to accurately generate the context of the prompt. Possible reasons are weak priors for these contexts, or inability to simultaneously generate topics and assigned concepts due to low co-occurrence probabilities in the training set. The second is the entanglement of context and appearance, i.e., the appearance of the subject changes due to the context of the hint, and the change of backpack color is used as an example in Figure 8. Third, we also observe that overfitting to real images occurs when the cues are similar to the original setting in which the subject was first seen. Other limitations include that some topics are easier to learn than others (such as dogs and cats). Occasionally, for less common topics, the model fails to support too many topic variations. Finally, there are also differences in subject accuracy, and some generated images may contain fictional subject features, depending on the strength of model priors and the complexity of semantic modification.

Guess you like

Origin blog.csdn.net/qq_43800752/article/details/131057905